Назад к подготовке

Вопрос по метрикам

A human reviewer and an automatic checker each output a list of found errors. How do you evaluate the checker?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Evaluate both decision quality and error-list quality: accept/reject confusion matrix, precision/recall/F1 over matched error objects, false-accept rate, false-reject rate, manual-review rate and customer complaint rate.

Полный разбор

There are two levels. At the decision level, compare accept/reject/manual-review against human ground truth and track false accepts, false rejects and manual-review share. False accepts are often more costly because bad work reaches a customer; false rejects increase cost and delay.

At the error-list level, treat human and model outputs as sets of structured errors. Match errors by type plus location/evidence with a tolerance rule. Then compute precision, recall and F1 for error detection. If location is unavailable, evaluate type-level recall separately from exact-location recall.

Add product metrics: customer complaints after auto-accept, reviewer override rate, time saved, cost per task, latency and drift by task category. A checker that finds many errors but sends everything to manual review may be accurate but not useful.

Типичные ошибки

  • Use only overall accuracy and ignore false accept cost.
  • Compare free-text explanations with string equality.
  • Forget the manual-review budget constraint.