ML System Design
You collected months of human-reviewer decisions for task outputs. How could you use this data to improve the automatic checker?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Create supervised examples from task/spec/output/reviewer errors, clean and deduplicate labels, split by time/task/customer, then train classifiers, rerankers or fine-tune an LLM for structured error detection.
Полный разбор
Reviewer data can become supervised training data: input is the task spec, worker output and evidence; target is the reviewer decision plus structured error list. Before training, normalize taxonomies, remove low-quality/disputed labels, deduplicate near-identical tasks and protect customer-sensitive fields.
Start with simpler models where possible: error-type classifiers, risk scoring models or retrieval of similar past failures. For LLMs, supervised fine-tuning can teach the desired output schema and error wording, while preference data can rank better explanations or reduce false accepts.
Validation must be time-aware and category-aware. Split by time, customer or task type to avoid memorizing templates. Track false accept rate, false reject rate, manual-review load and per-category degradation. Keep humans in the loop for uncertain or high-risk outputs.