Обязательно

Streaming DataLoaders and Storage

WebDataset, object storage, tar shards, shuffle quality, DALI/NVDEC, prefetching and avoiding GPU starvation.

Время изучения: 28 мин

Streaming DataLoaders and Storage Formats

WebDataset, object storage, sharded tar files, HF streaming datasets, DALI/NVDEC, shuffle constraints and data-loader bottleneck debugging.

Что должен уметь кандидат

  • Explain why sharding and sequential reads reduce small-file overhead for large media datasets.
  • Use streaming/iterable datasets when full materialization is impractical.
  • Understand random access, shuffle quality, resumability and throughput trade-offs.
  • Profile whether training is GPU-bound, CPU preprocessing-bound, storage-bound or network-bound.

Что спрашивают на собеседовании

  • How would you feed petabyte-scale image/video data to distributed training?
  • What breaks when IterableDataset has no random access?
  • How do shard size and shuffle buffer affect training?

Практическая задача

Design loader plan for image-text pretraining: shard layout, manifest, worker assignment, resume, shuffle and failure handling.

Source-grounded правило

Throughput claims should include storage backend, network, decode path and number of workers.