Назад к подготовке

Интуиция Adam, momentum и RMSProp

Интуиция Adam, momentum и RMSProp

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

SGD updates weights by minibatch gradients. Momentum smooths direction with a running average of gradients. RMSProp normalizes by a running average of squared gradients. Adam combines both with bias-corrected first and second moments.

Полный разбор

Plain minibatch SGD computes a gradient on a batch and moves parameters opposite that gradient with a learning rate. Its noise can help escape shallow local structure, but it can zigzag or move slowly in ill-conditioned directions.

Momentum keeps an exponential moving average of gradients. This first-moment estimate behaves like velocity: directions that persist accumulate speed, while noisy sign changes are smoothed.

RMSProp keeps an exponential moving average of squared gradients. This second-moment estimate rescales updates coordinate-wise, reducing steps where gradients are consistently large or volatile. Adam combines momentum and RMSProp: update roughly follows first_moment / sqrt(second_moment + epsilon), usually with bias correction in early steps.

Теория

Adaptive optimizers change the effective learning rate per parameter based on gradient history.

Типичные ошибки

  • Say Adam is just SGD with a different learning rate.
  • Mix up first moment and second moment.
  • Forget the epsilon and square root role in normalization.
  • Assume Adam always generalizes better than SGD.

Как отвечать на собеседовании

  • Describe first moment as velocity and second moment as scale normalization.
  • Name AdamW if asked about decoupled weight decay.