Обучение со смешанной точностью, FP16/BF16 и память
Обучение со смешанной точностью, FP16/BF16 и память
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Autocast runs selected ops in lower precision while keeping sensitive ops or master state in higher precision. FP16 has more mantissa precision but smaller range; BF16 has FP32-like range. FP16 often needs GradScaler.
Полный разбор
Mixed precision means not every operation and state tensor uses the same dtype. In PyTorch, autocast chooses lower precision for many forward operations while keeping numerically sensitive operations in safer precision. Optimizers often keep master weights or optimizer state in FP32, even when forward computation uses FP16 or BF16.
FP16 has more mantissa precision than BF16 but a much smaller exponent range. BF16 has less precision but a range similar to FP32, which often makes training more stable on supported hardware. With FP16, gradient underflow is common enough that PyTorch workflows usually use GradScaler to scale the loss before backward and unscale before optimizer step.
Other memory/speed tools are separate: activation checkpointing trades compute for activation memory, DDP/FSDP/ZeRO shard or replicate parameters and optimizer state differently, and LoRA changes which parameters are trained.
Теория
Mixed precision is a numerical and systems optimization, not just converting the whole model to half.
Типичные ошибки
- Call model.half() and assume training is solved.
- Reverse the FP16/BF16 range tradeoff.
- Forget activations often dominate memory in deep networks.
Как отвечать на собеседовании
- Mention autocast and GradScaler as the practical PyTorch pair for FP16.
- Separate precision, checkpointing and sharding in the answer.