InnerControl | Project Page

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback.

¹ AIRI, Moscow
² HSE, University

Abstract

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. One of the popular approaches for this task is ControlNet, which introduces an auxiliary conditioning module into the architecture. To improve alignment of the generated image and control, ControlNet++ proposes a cycle consistency loss to refine correspondence between controls and outputs, but restricts its application to the final denoising steps, while the main structure is introduced at an early generation stage. To address this issue, we suggest InnerControl — a training strategy that enforces spatial consistency across all diffusion steps. Specifically, we train lightweight control prediction probes — small convolutional networks — to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. We prove the efficiency of such models to extract signals even from very noisy latents and utilize these models to generate pseudo ground truth controls during training. The suggested approach enables an alignment loss that minimizes the difference between predicted and target condition throughout the whole diffusion process. Our experiments demonstrate that our method improves control alignment and fidelity of generation. By integrating this loss with established training techniques (e.g., ControlNet++), we achieve high performance across different condition methods such as edge and depth conditions.

Quantitative Results

Unified comparison on the MultiGen‑20M benchmark. Controllability is measured by SSIM (↑) for HED/LineArt and RMSE(↓) for Depth; fidelity by FID (↓); relevance by CLIP‑score(↑). ^* – denotes training model from scratch using suggested in paper hyperparameters.

Qualitative Results

**Visualization of difference between extracted signal from intermediate features and input control after our training applied (top) and for standard ControlNet (bottom)**

More Visualization Results

Selected Control Type

BibTeX

If you find our work useful in your research, please consider citing:


        @misc{konovalova2025heedinginnervoicealigning,
          title={Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback}, 
          author={Nina Konovalova and Maxim Nikolaev and Andrey Kuznetsov and Aibek Alanov},
          year={2025},
          eprint={2507.02321},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2507.02321}, 
    }