Training Stability and Checkpointing

Operational layer of long distributed runs: NaNs, loss spikes, sharded checkpoints, resume semantics, RNG/scaler state, observability and recovery.

Отличать activation checkpointing от training checkpointing.
Понимать what must be saved: model, optimizer, scheduler, scaler, RNG, dataloader/progress and framework config.
Составлять recovery checklist for FSDP/ZeRO or multi-node failures.
Диагностировать divergence after resume without assuming one universal cause.

Практическая задача

Реализовать kill-and-resume test для sharded training and compare loss curve before/after resume.

Source-grounded правило

Checkpoint portability and resume semantics are version/config specific; avoid claiming universal compatibility.

Материалы

Primary source for distributed checkpoint APIs.

DeepSpeed-specific checkpointing and recomputation behavior.

2024 engineering report with operational reliability context.