Назад к подготовке

ВопросСредняяdeep-learningТехническое собеседование · Tochka Tochka

Интуиция Adam, momentum и RMSProp

Интуиция Adam, momentum и RMSProp

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

SGD updates weights by minibatch gradients. Momentum smooths direction with a running average of gradients. RMSProp normalizes by a running average of squared gradients. Adam combines both with bias-corrected first and second moments.

Полный разбор

Plain minibatch SGD computes a gradient on a batch and moves parameters opposite that gradient with a learning rate. Its noise can help escape shallow local structure, but it can zigzag or move slowly in ill-conditioned directions. Momentum keeps an exponential moving average of gradients. This first-moment estimate behaves like velocity: directions that persist accumulate speed, while noisy sign changes are smoothed. RMSProp keeps an exponential moving average of squared gradients. This second-moment estimate rescales updates coordinate-wise, reducing steps where gradients are consistently large or volatile. Adam combines momentum and RMSProp: update roughly follows first_moment / sqrt(second_moment + epsilon), usually with bias correction in early steps.

Теория

Adaptive optimizers change the effective learning rate per parameter based on gradient history.

Типичные ошибки

Say Adam is just SGD with a different learning rate.
Mix up first moment and second moment.
Forget the epsilon and square root role in normalization.
Assume Adam always generalizes better than SGD.

Как отвечать на собеседовании

Describe first moment as velocity and second moment as scale normalization.
Name AdamW if asked about decoupled weight decay.