Назад к подготовке

Dropout, BatchNorm и fine-tuning на маленьких батчах

Dropout, BatchNorm и fine-tuning на маленьких батчах

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Dropout is stochastic during training and disabled or rescaled at inference. BatchNorm uses batch statistics during training and running statistics at inference; small batches make those statistics noisy, so freezing BN or using LayerNorm/GroupNorm can be safer.

Полный разбор

Dropout randomly masks activations during training to reduce co-adaptation. At inference the model should use the full network with the convention used by the framework: either activations are scaled during training ("inverted dropout") or they are scaled at inference.

BatchNorm normalizes activations using batch mean and variance during training and running estimates during inference. Small batches make the estimates noisy, and distributed training can make per-device statistics inconsistent. This is why fine-tuning with tiny batches often freezes BatchNorm, switches it to eval mode, uses pre-trained running stats, or replaces it with LayerNorm, GroupNorm or InstanceNorm where appropriate.

With many GPUs and effective large batches, SyncBatchNorm can aggregate statistics across devices, but it adds communication cost. Gradient accumulation increases effective batch size for gradients, but does not automatically fix per-forward BatchNorm statistics unless the implementation accounts for it.

Теория

Normalization layers are part of the optimization and inference contract, not just a harmless architectural detail.

Типичные ошибки

  • Assume gradient accumulation fixes BatchNorm statistics.
  • Forget that train and eval modes change BatchNorm and dropout behavior.
  • Use BatchNorm with batch size 1 and expect stable statistics.

Как отвечать на собеседовании

  • Mention freezing BatchNorm during fine-tuning.
  • Separate optimization batch size from normalization statistics.