Streaming DataLoaders and Storage Formats
WebDataset, object storage, sharded tar files, HF streaming datasets, DALI/NVDEC, shuffle constraints and data-loader bottleneck debugging.
Что должен уметь кандидат
- Explain why sharding and sequential reads reduce small-file overhead for large media datasets.
- Use streaming/iterable datasets when full materialization is impractical.
- Understand random access, shuffle quality, resumability and throughput trade-offs.
- Profile whether training is GPU-bound, CPU preprocessing-bound, storage-bound or network-bound.
Что спрашивают на собеседовании
- How would you feed petabyte-scale image/video data to distributed training?
- What breaks when IterableDataset has no random access?
- How do shard size and shuffle buffer affect training?
Практическая задача
Design loader plan for image-text pretraining: shard layout, manifest, worker assignment, resume, shuffle and failure handling.
Source-grounded правило
Throughput claims should include storage backend, network, decode path and number of workers.