Обязательно

Multimodal Conditioning

Text, image, video, pose, depth, segmentation, audio and reference conditioning via cross-attention, adapters and ControlNet-style branches.

Время изучения: 28 мин

Multimodal Conditioning and Control

Text encoders, cross-attention, CLIP/T5-style conditioning, ControlNet spatial controls, adapters and reference conditioning.

Что должен уметь кандидат

  • Design conditioning APIs for text, image, pose, depth, mask, audio or reference inputs.
  • Distinguish global text guidance from spatial controls.
  • Explain ControlNet-style side branches without claiming they solve every alignment problem.
  • Identify failures in prompt following, spatial relations and counting.

Что спрашивают на собеседовании

  • How would you add pose conditioning without retraining the whole base model?
  • Why do text encoders affect typography and prompt following?
  • What happens if conditioning signals conflict?

Практическая задача

Build a small ControlNet demo with Canny/depth/pose conditioning and document prompt/control strength experiments.

Source-grounded правило

Treat CLIP/alignment scores as signals, not proof of safety, legality or exact semantic correctness.

Материалы