Вопрос по метрикам
A human reviewer and an automatic checker each output a list of found errors. How do you evaluate the checker?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Evaluate both decision quality and error-list quality: accept/reject confusion matrix, precision/recall/F1 over matched error objects, false-accept rate, false-reject rate, manual-review rate and customer complaint rate.
Полный разбор
There are two levels. At the decision level, compare accept/reject/manual-review against human ground truth and track false accepts, false rejects and manual-review share. False accepts are often more costly because bad work reaches a customer; false rejects increase cost and delay.
At the error-list level, treat human and model outputs as sets of structured errors. Match errors by type plus location/evidence with a tolerance rule. Then compute precision, recall and F1 for error detection. If location is unavailable, evaluate type-level recall separately from exact-location recall.
Add product metrics: customer complaints after auto-accept, reviewer override rate, time saved, cost per task, latency and drift by task category. A checker that finds many errors but sends everything to manual review may be accurate but not useful.
Типичные ошибки
- Use only overall accuracy and ignore false accept cost.
- Compare free-text explanations with string equality.
- Forget the manual-review budget constraint.