Обязательно

Foundation Model Data Pipelines

Collection, licensing, preprocessing, metadata, captioning, filtering and reproducible dataset versions for large models.

Время изучения: 30 мин

Foundation Model Data Pipelines

End-to-end architecture for collecting, normalizing, documenting, versioning, filtering, sharding and serving large-scale pretraining data.

Что должен уметь кандидат

  • Explain ingestion, extraction, filtering, deduplication, documentation, sharding, loading and monitoring stages.
  • Separate quality, diversity, safety, legality and throughput as different concerns.
  • Design provenance-aware manifest with source, license, modality, filters and known limitations.
  • Use model eval feedback to improve data curation instead of guessing.

Что спрашивают на собеседовании

  • How would you design reproducible pretraining data pipeline from web sources?
  • What metadata would you persist per sample and why?
  • How do you decide whether a data-quality improvement is real?

Практическая задача

Draft pipeline spec for a 10B-token corpus: source manifest, extraction/filtering, dedup, dataset card fields and validation metrics.

Source-grounded правило

Dataset size and quality claims should cite papers/reports; data legality/safety claims require cautious wording.

Материалы