Назад к подготовке

ВопросСложнаяspeech-mlML System Design на техническом собеседовании · Chinor Chinor

ASR для low-resource языка, когда Whisper не справляется

ASR для low-resource языка, когда Whisper не справляется

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Collect in-domain transcriptions, fine-tune ASR on the target language/domain, preserve timestamps, and optionally train downstream extraction directly from audio if transcript quality remains too low.

Полный разбор

If the generic ASR fails on the target language, the first useful investment is in-domain data. Sample real calls across operators, branches, noise conditions and call outcomes. Ask native-speaking annotators to transcribe speech and, if needed, mark intervals where branch/time decisions are made. Fine-tune an ASR model on these transcripts rather than trying to solve everything with a downstream LLM. The extraction model depends on dates, addresses and confirmations, so systematic ASR errors in those entities will dominate. If full transcription is expensive, use a staged annotation strategy: first label outcome and final booking fields from the operator table, then add transcript or timestamp labels for confusing cases. Keep the business labels and ASR transcript labels separate, because they train different stages.

Теория

For low-resource speech systems, domain and language data quality often matters more than the downstream extractor architecture.

Типичные ошибки

Assume multilingual Whisper is good enough without measuring entity errors.
Label only final booking fields and expect ASR to improve.
Ask operators to relisten to every call as part of normal workflow.
Ignore language and noise stratification in sampling.

Как отвечать на собеседовании

Say you would fine-tune ASR on in-domain calls.
Separate ASR labels from final booking labels.