Вопрос про production ML
A new perception detector improves some offline metrics but degrades others. How do you decide whether to ship it to production?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Do not average metrics blindly. Map them to product and safety costs, inspect critical slices, require guardrails, and roll out only if the target operating point improves without violating constraints.
Полный разбор
For an autonomous-driving detector, the decision is not simply "two metrics up, two down". First define what each metric means for the product and safety case. A small recall regression on pedestrians can be unacceptable even if mAP improves, while extra false positives might be tolerable if the planner can safely slow down.
The practical flow is to choose primary metrics and guardrails, inspect per-class and per-scenario slices, evaluate at the intended threshold or operating point, and compare against regression budgets. Important slices can include pedestrians, cyclists, night, rain, occlusion, distance buckets and rare critical scenarios.
Only after offline checks pass should the team consider shadow mode, canary, simulation replay or staged rollout. The stronger answer explicitly ties metrics to downstream system behavior and business/safety constraints rather than treating every offline metric equally.
Теория
Production ML decisions require a utility or risk model, not just aggregate metric comparison.
Типичные ошибки
- Average unrelated metrics into one score with no justification.
- Ignore safety-critical slices.
- Ship directly from offline improvement without rollout controls.
Как отвечать на собеседовании
- Say which error is worse for this product.
- Mention threshold choice and per-slice validation.