Streaming DataLoaders and Storage — Advanced ML Engineering

Streaming DataLoaders and Storage Formats

WebDataset, object storage, sharded tar files, HF streaming datasets, DALI/NVDEC, shuffle constraints and data-loader bottleneck debugging.

Explain why sharding and sequential reads reduce small-file overhead for large media datasets.
Use streaming/iterable datasets when full materialization is impractical.
Understand random access, shuffle quality, resumability and throughput trade-offs.
Profile whether training is GPU-bound, CPU preprocessing-bound, storage-bound or network-bound.

Практическая задача

Design loader plan for image-text pretraining: shard layout, manifest, worker assignment, resume, shuffle and failure handling.

Source-grounded правило

Throughput claims should include storage backend, network, decode path and number of workers.