Latency, Cost and Observability
Production metrics: queue depth, GPU utilization, p95/p99 latency, error modes, cost per request, regression detection and rollout monitoring.
Что должен уметь кандидат
- Define SLOs for ML systems beyond average latency.
- Monitor GPU utilization, queue depth, tokens/sec, cost/request and model-quality regressions.
- Design canary/shadow rollout for ML serving changes.
- Connect technical metrics with product and business constraints.
Что спрашивают на собеседовании
- What metrics prove optimization worked?
- How would you catch silent quality regressions?
- How do online experiments differ for ranking vs generation?
Практическая задача
Build an observability spec for an ML inference service: metrics, alerts, dashboards, rollout gates and cost attribution.
Source-grounded правило
Use industry posts as patterns, not universal blueprints; adapt monitoring to task and traffic shape.