Вопрос про production ML

A speech product collects user audio. How would you filter and route audio snippets for ASR/TTS training data without poisoning the dataset?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Build a staged pipeline: consent/privacy checks, language and speech-quality detection, VAD/diarization, ASR confidence or human QC, deduplication, domain balancing and versioned dataset promotion.

Полный разбор

Start with product and privacy constraints. Only audio allowed by consent and policy should enter training candidates. Strip or protect personal data where required, track provenance, and keep a data-retention policy separate from model-training convenience. Then add quality gates: voice activity detection, duration limits, noise/SNR checks, clipping detection, language ID, speaker/channel metadata and ASR confidence. For TTS, speaker quality and consistency matter; for ASR, diverse acoustic conditions and accurate transcripts matter. Bad labels can be worse than less data. Finally, route the data into versioned datasets. Deduplicate near-identical clips, balance by language/domain/accent/device, sample for human review, and store rejection reasons. Promotion from raw audio to train-ready data should be reproducible so a model can always be traced back to the dataset version and filters used.