FSDP, DeepSpeed ZeRO and Sharding

Шардирование model states вместо полной репликации: optimizer states, gradients and parameters in PyTorch FSDP and DeepSpeed ZeRO.

Сравнить DDP, ZeRO-1/2/3 and FSDP by what is replicated or sharded.
Понимать memory/communication trade-off: lower memory often means more communication and orchestration complexity.
Выбирать FSDP/ZeRO when model states or activations do not fit in GPU memory.
Понимать risks: wrapping policy, offload, checkpoint format and version compatibility.

Практическая задача

На toy Transformer сравнить DDP vs FSDP или DeepSpeed ZeRO-2/3 по peak memory, step time, checkpoint size and resume behavior.

Source-grounded правило

Численные claims about memory savings or scaling must be tied to specific FSDP/ZeRO docs, papers or engineering reports, not generalized.

Материалы

Primary source for FSDP sharding strategies and state dict handling.

Primary source for ZeRO partitioning behavior.

Engineering motivation for ZeRO and large-model memory optimization.