Интуиция Adam, momentum и RMSProp
Интуиция Adam, momentum и RMSProp
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
SGD updates weights by minibatch gradients. Momentum smooths direction with a running average of gradients. RMSProp normalizes by a running average of squared gradients. Adam combines both with bias-corrected first and second moments.
Полный разбор
Plain minibatch SGD computes a gradient on a batch and moves parameters opposite that gradient with a learning rate. Its noise can help escape shallow local structure, but it can zigzag or move slowly in ill-conditioned directions.
Momentum keeps an exponential moving average of gradients. This first-moment estimate behaves like velocity: directions that persist accumulate speed, while noisy sign changes are smoothed.
RMSProp keeps an exponential moving average of squared gradients. This second-moment estimate rescales updates coordinate-wise, reducing steps where gradients are consistently large or volatile. Adam combines momentum and RMSProp: update roughly follows first_moment / sqrt(second_moment + epsilon), usually with bias correction in early steps.
Теория
Adaptive optimizers change the effective learning rate per parameter based on gradient history.
Типичные ошибки
- Say Adam is just SGD with a different learning rate.
- Mix up first moment and second moment.
- Forget the epsilon and square root role in normalization.
- Assume Adam always generalizes better than SGD.
Как отвечать на собеседовании
- Describe first moment as velocity and second moment as scale normalization.
- Name AdamW if asked about decoupled weight decay.