Назад к подготовке

ВопросСложнаяmodel-evaluationВопрос по метрикам на скрининге · Focus / Teramind

Вопрос по метрикам

A retail video analytics model should flag suspicious behavior, but humans do not fully agree on what “suspicious” means. How would you define success and evaluate whether the system is doing a good job?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

First turn “suspicious” into operational categories and severity levels, measure human agreement, build a labeled review set with adjudication, then optimize risk-calibrated precision/recall and downstream business outcomes.

Полный разбор

If humans disagree, the first task is not model selection; it is label design. Define categories of suspicious behavior, severity, required action and non-goals. Measure inter-annotator agreement and keep a third-party or senior-review adjudication process for hard examples. Evaluation should mix ML metrics and product metrics. Offline, use a stratified video set across stores, camera positions, time of day and traffic level. Track precision, recall, false alarms per hour, missed high-severity incidents, calibration and performance by subgroup or environment. If labels remain ambiguous, report soft labels or agreement-weighted metrics rather than pretending there is one perfect ground truth. Online, measure analyst workload, alert acceptance rate, time to incident review, customer/store outcomes and appeal rate. Thresholds should be risk-based: high-confidence severe events can alert immediately; uncertain low-severity events can go to passive review or sampling. Monitoring must watch drift by store layout, seasonality, camera changes and policy changes.

Теория

Ambiguous-label systems need operational definitions, agreement measurement and risk-calibrated decision thresholds before model metrics become meaningful.

Типичные ошибки

Optimize accuracy on a noisy label without defining the action.
Ignore annotator disagreement.
Use one global threshold for all severities and stores.
Forget false alarms per hour and reviewer workload.

Как отвечать на собеседовании

Start by defining the label and action.
Bring up inter-annotator agreement and adjudication.