Distributed Training Foundations

База distributed training: процессы, rank/world size, process groups, collectives, DDP gradient synchronization and NCCL as the GPU communication backend.

Что должен уметь кандидат

Объяснить lifecycle одного DDP step: forward, backward, gradient buckets, all-reduce, optimizer step.
Отличать DataParallel от DistributedDataParallel and explain why DDP is the default baseline for serious multi-GPU training.
Понимать collectives: all-reduce, reduce-scatter, all-gather, broadcast.
Диагностировать типичные причины hangs: rank mismatch, different collectives, dataloader imbalance, NCCL/network issues.

Что спрашивают на собеседовании

Почему DDP обычно предпочтительнее DataParallel?
Что происходит с градиентами в backward pass?
Почему один rank может зависнуть, хотя остальные считают?
Как effective batch size связан с world size и gradient accumulation?

Практическая задача

Перевести single-GPU training loop на torchrun + DDP, добавить логи rank/world size, effective batch size, throughput and peak GPU memory.

Source-grounded правило

Основные утверждения сверять с PyTorch DDP API/design notes and NVIDIA NCCL docs; не обещать линейное масштабирование без оговорки про model size, interconnect and batch.

Материалы

Дополнительно

PyTorch DistributedDataParallel Docs

Primary API source for DDP behavior and constraints.

PyTorch DDP Design Note

Explains reducer behavior and synchronization model.

NVIDIA NCCL User Guide

Primary source for GPU collective communication primitives.