Multimodal Conditioning and Control
Text encoders, cross-attention, CLIP/T5-style conditioning, ControlNet spatial controls, adapters and reference conditioning.
Что должен уметь кандидат
- Design conditioning APIs for text, image, pose, depth, mask, audio or reference inputs.
- Distinguish global text guidance from spatial controls.
- Explain ControlNet-style side branches without claiming they solve every alignment problem.
- Identify failures in prompt following, spatial relations and counting.
Что спрашивают на собеседовании
- How would you add pose conditioning without retraining the whole base model?
- Why do text encoders affect typography and prompt following?
- What happens if conditioning signals conflict?
Практическая задача
Build a small ControlNet demo with Canny/depth/pose conditioning and document prompt/control strength experiments.
Source-grounded правило
Treat CLIP/alignment scores as signals, not proof of safety, legality or exact semantic correctness.