Назад к подготовке

Отладка разрыва между офлайн-оценкой и качеством в продукте

Отладка разрыва между офлайн-оценкой и качеством в продукте

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

First verify serving parity, then inspect complaint examples, compare them to validation slices, and look for domain shift, ASR errors, unseen terminology or weak metrics.

Полный разбор

Start with engineering parity: the model version, tokenizer, preprocessing, normalization, thresholds and postprocessing must match validation. Reproduce user examples offline to see whether the served model and offline model give the same output.

Then analyze the data. Complaints may come from domains missing in validation, new terms, named entities, acronyms, speaker disfluencies, ASR substitutions, long contexts or non-standard punctuation style. Build slices from the complained examples and compare their metrics to the aggregate validation set.

Finally question the metric. A high token-level score can still produce unreadable sentences if rare sentence-boundary errors are bad. Add production monitoring, sampled human review, complaint tagging and active-learning loops that feed hard cases into the next validation/training set.

Теория

Offline metrics validate a distribution and a metric definition; online failures often mean the distribution or metric was wrong.

Типичные ошибки

  • Assume complaints are noise because validation accuracy is high.
  • Skip serving-tokenizer parity checks.
  • Retrain blindly without labeling failed examples.
  • Ignore that user complaints may target rare but severe errors.

Как отвечать на собеседовании

  • Say “reproduce the exact online example offline” early.
  • Separate serving bugs from data/metric problems.