Foundation Model Data Pipelines
End-to-end architecture for collecting, normalizing, documenting, versioning, filtering, sharding and serving large-scale pretraining data.
Что должен уметь кандидат
- Explain ingestion, extraction, filtering, deduplication, documentation, sharding, loading and monitoring stages.
- Separate quality, diversity, safety, legality and throughput as different concerns.
- Design provenance-aware manifest with source, license, modality, filters and known limitations.
- Use model eval feedback to improve data curation instead of guessing.
Что спрашивают на собеседовании
- How would you design reproducible pretraining data pipeline from web sources?
- What metadata would you persist per sample and why?
- How do you decide whether a data-quality improvement is real?
Практическая задача
Draft pipeline spec for a 10B-token corpus: source manifest, extraction/filtering, dedup, dataset card fields and validation metrics.
Source-grounded правило
Dataset size and quality claims should cite papers/reports; data legality/safety claims require cautious wording.