ControlNet and structural conditioning in diffusion models

TL;DR: ControlNet adds spatial control to diffusion models by attaching a trainable copy of the model that processes control signals (edges, poses, depth) via zero-initialized convolutions, preserving the original model’s knowledge while enabling precise image generation guidance [1].

Architecture

ControlNet employs a dual-copy design: the original pretrained diffusion model remains locked to preserve learned features, while a parallel trainable copy learns task-specific spatial conditioning [2]. Zero convolutions—1×1 convolutions initialized with zero weights and biases—bridge the two copies and inject learned control features back into the backbone without disrupting pretrained knowledge [1]. This approach avoids the “catastrophic forgetting” problem typical when fine-tuning large models.

How Structural Conditioning Works

Users provide a control signal—a Canny edge map, human pose skeleton, depth map, or segmentation mask—which the trainable branch encodes as spatial features. These features are fused into the diffusion model’s hidden states at multiple layers, guiding the generation process toward structures matching the control input while leaving visual details (texture, style, color) to the text prompt and noise schedule [4].

The training exhibits “sudden convergence”—the model produces random noise for thousands of steps before abruptly learning correct conditioning within ~10,000 iterations [2].

Supported Control Types

ControlNet supports nine conditioning modalities: Canny edge detection, M-LSD line detection, HED boundary maps, user scribbles, OpenPose skeleton tracking, semantic segmentation, depth mapping, normal maps, and cartoon line detection [3]. Multiple ControlNets can compose simultaneously for multi-condition control.

Advantages

Model preservation: Training is efficient on small datasets (<50k images) without degrading the base model [1].

Deterministic control: Unlike classifier-free guidance (which steers generation probabilistically), ControlNet’s trainable branch directly embeds spatial constraints into the diffusion process.

Transfer learning: Trained ControlNets transfer across related tasks and community models via stable-diffusion-art.com and A1111 WebUI [3].