Назад к подготовке

Токенизация и BERT-style разметка против autoregressive rewriting

Токенизация и BERT-style разметка против autoregressive rewriting

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

BERT-style labeling sees both left and right context in one pass and predicts constrained labels, while autoregressive rewriting may change words. Tokenization should align labels to words or word starts.

Полный разбор

Autoregressive generation is natural for text rewriting, but this task is constrained: preserve words and only add punctuation/case. Generation can hallucinate, correct ASR words that should remain unchanged, or make the decoding process expensive.

A BERT-style encoder runs over the whole sequence once and predicts labels for each word or word boundary. This uses bidirectional context, which is important for punctuation, names and sentence boundaries. It also keeps the output constrained to a small class set.

Tokenization is the main detail. If using subword BPE, align labels to the first subtoken of each word and mask the rest, or use a tokenizer/preprocessing scheme that keeps word boundaries explicit. Avoid relying on tokens that merge spaces, punctuation and word pieces in ways that make capitalization labels ambiguous.

Теория

BERT-like encoders are well suited to token classification because every position can use full context without decoding a new sequence.

Типичные ошибки

  • Put punctuation labels on arbitrary subword pieces.
  • Ignore how spaces are encoded in common BPE tokenizers.
  • Use autoregressive output without checking word preservation.
  • Assume capitalization is always sentence-initial and forget named entities.

Как отвечать на собеседовании

  • Mention label masking for non-first subtokens.
  • Frame hallucination as a product bug, not just a model detail.