BatchNorm при обучении, инференсе и multi-GPU
BatchNorm при обучении, инференсе и multi-GPU
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
BatchNorm normalizes activations by batch/channel statistics, then applies learned gamma and beta. Inference uses running statistics. Multi-GPU SyncBN must aggregate enough raw statistics to compute global mean and variance.
Полный разбор
For image tensors, BatchNorm usually computes per-channel mean and variance across the batch and spatial dimensions. It normalizes by subtracting the mean and dividing by sqrt(variance + eps), then applies learned scale gamma and shift beta. There are 2C learned parameters for C channels.
At inference, the layer should not depend on the current small batch. It uses running mean and running variance collected during training. In multi-GPU training, ordinary BatchNorm may use only local per-device statistics. SyncBatchNorm aggregates across workers so the layer behaves closer to a larger global batch.
For variance, simply averaging local standard deviations is wrong. Implementations communicate statistics such as sums, squared sums and counts, or equivalent sufficient statistics, then derive the global mean and variance.
Теория
BatchNorm has learned affine parameters and non-learned batch/running statistics.
Типичные ошибки
- Divide by variance instead of standard deviation.
- Assume averaging local variances is always enough.
- Forget running stats at inference.
Как отвечать на собеседовании
- Mention gamma and beta explicitly.
- For SyncBN, say what statistics must be globally aggregated.