Восстановление пунктуации и капитализации в ASR-тексте
Восстановление пунктуации и капитализации в ASR-тексте
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Frame it as token-level sequence labeling: for each word predict capitalization and punctuation-after-word classes. This preserves ASR words and avoids generative hallucination.
Полный разбор
A simple baseline is prompting an LLM to rewrite the text, but that can change words, be expensive and make output harder to constrain. A better production framing is sequence labeling over the original ASR tokens.
For every word, predict two labels: capitalization class for the word and punctuation class after the word. Capitalization can be binary or richer if title case/acronyms matter. Punctuation can be none, comma, period, question mark, colon and other supported symbols. A shared Transformer encoder can produce contextual token representations and two classifier heads.
Training data can be created by taking clean punctuated text, normalizing it to lowercase words without punctuation as input, and using the original punctuation/case as labels. LLMs or human review can help bootstrap domain-specific data, but the model should be evaluated on real ASR-like text because ASR errors and spoken language differ from clean written corpora.
Теория
When the task is to annotate an existing sequence, sequence labeling is safer and cheaper than free-form generation.
Типичные ошибки
- Let a generative model rewrite words when only punctuation/case should change.
- Ignore ASR-specific errors and train only on clean written text.
- Treat punctuation and capitalization as independent without shared context.
- Forget acronyms, names and domain-specific terms.
Как отвечать на собеседовании
- Say explicitly that the output must preserve original words.
- Use two heads: punctuation and capitalization.