Multimodal Conditioning — Advanced ML Engineering

Multimodal Conditioning and Control

Text encoders, cross-attention, CLIP/T5-style conditioning, ControlNet spatial controls, adapters and reference conditioning.

Design conditioning APIs for text, image, pose, depth, mask, audio or reference inputs.
Distinguish global text guidance from spatial controls.
Explain ControlNet-style side branches without claiming they solve every alignment problem.
Identify failures in prompt following, spatial relations and counting.

Практическая задача

Build a small ControlNet demo with Canny/depth/pose conditioning and document prompt/control strength experiments.

Source-grounded правило

Treat CLIP/alignment scores as signals, not proof of safety, legality or exact semantic correctness.