LLM Evaluation, Reliability and Cost
Evaluation harnesses, task-specific evals, regression checks, cost-per-token reasoning, benchmark caveats and production monitoring.
Что должен уметь кандидат
- Separate leaderboard quality from product utility and reliability.
- Choose evals for capability, safety, latency and cost.
- Estimate serving cost from token mix, GPU price, utilization and cache behavior.
- Design rollout gates: offline evals, human review, canary and monitoring.
Что спрашивают на собеседовании
- Why are leaderboards insufficient?
- How do you evaluate a model change before rollout?
- What metrics decide whether serving optimization worked?
Практическая задача
Create eval-and-cost scorecard with quality evals, latency SLOs, throughput target, token budget, GPU assumptions and go/no-go criteria.
Source-grounded правило
Cost examples must be recalculated from current GPU prices and traffic assumptions; do not hard-code stale numbers.