Пройти собеседование: Toloka AI: ML System Design

1Кейс15 мин

ML System Design

Design an automatic system that checks whether a human/agent task result is good enough before delivery to a customer. How do you frame the ML problem?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Frame it as quality-control risk estimation plus structured error detection: given task spec, worker output and available evidence, decide accept/reject/manual-review and produce actionable error reasons.

Подробный разбор

Start with inputs: task description, customer requirements, worker or agent output, attachments, logs and possibly historical reviewer decisions. Outputs should not be only a scalar score. A practical checker returns accept/reject/manual-review plus structured error reasons and evidence.

Define the decision policy around costs. False accept means a bad result reaches the customer; false reject means extra cost, delay and possibly unfair worker feedback. Manual review is a constrained fallback, so the model should optimize quality under a review-budget constraint.

A reasonable first version combines rules and LLM-based checking: validate required files/fields, check format, compare output to task requirements, flag fraud or hallucination signals, and send uncertain/high-risk cases to humans. Later versions can train classifiers or fine-tune models from reviewer labels.

Типичные ошибки

Make the output a vague quality score with no reason codes.
Ignore false accept versus false reject cost asymmetry.
Forget manual-review capacity as a hard constraint.

2Кейс10 мин

ML System Design

What should the output schema of an automatic task checker look like if humans also produce lists of found errors?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Use a structured list of error objects with type, severity, location/evidence, explanation and suggested action, plus an overall decision. This makes human/model comparison and downstream operations possible.

Подробный разбор

A checker should produce structured evidence, not just free-form text. A useful schema is: overall_decision, confidence, and errors[] where each error has type, severity, affected artifact/location, evidence quote or pointer, explanation and suggested fix.

The taxonomy should be stable enough for metrics: missing file, inaccessible link, format violation, factual mismatch, hallucination, instruction mismatch, fraud/spam and low-quality output are examples. Free text can remain as explanation, but type and location should be machine-readable.

This schema supports evaluation against human reviewers. You can compare sets of error objects by type and location, count false accepts/rejects, inspect disagreements and route specific error types to specialized follow-up checks.

3Кейс15 мин

ML System Design

How would you design an LLM-agent loop that checks a task output using tools such as file reading, web access or document inspection?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Use a planner-checker loop: parse task/spec/output, generate hypotheses to verify, call constrained tools, accumulate evidence, then produce a structured verdict with uncertainty and escalation.

Подробный разбор

The first step is context packaging: task spec, worker output, attachments and customer requirements must be normalized into a representation the checker can use. If artifacts are large, retrieve or summarize relevant parts instead of dumping everything into one prompt.

Then run an explicit loop. The model proposes checks such as file exists, table value matches requirement, link accessible, claim supported, image contains requested object. Tools execute those checks with constrained permissions. Results are appended as evidence, and the model either asks for another check or decides.

Production guardrails matter: limit tool calls, make tool outputs deterministic, log every check, keep a timeout budget and escalate uncertain cases. The final answer should be a structured verdict grounded in tool evidence, not just the model’s unsupported opinion.

Типичные ошибки

Use a single LLM call and call it an agent.
Give the agent unbounded tools and no budget.
Fail to log evidence for later reviewer audits.

4Кейс15 мин

Вопрос по метрикам

A human reviewer and an automatic checker each output a list of found errors. How do you evaluate the checker?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Evaluate both decision quality and error-list quality: accept/reject confusion matrix, precision/recall/F1 over matched error objects, false-accept rate, false-reject rate, manual-review rate and customer complaint rate.

Подробный разбор

There are two levels. At the decision level, compare accept/reject/manual-review against human ground truth and track false accepts, false rejects and manual-review share. False accepts are often more costly because bad work reaches a customer; false rejects increase cost and delay.

At the error-list level, treat human and model outputs as sets of structured errors. Match errors by type plus location/evidence with a tolerance rule. Then compute precision, recall and F1 for error detection. If location is unavailable, evaluate type-level recall separately from exact-location recall.

Add product metrics: customer complaints after auto-accept, reviewer override rate, time saved, cost per task, latency and drift by task category. A checker that finds many errors but sends everything to manual review may be accurate but not useful.

Типичные ошибки

Use only overall accuracy and ignore false accept cost.
Compare free-text explanations with string equality.
Forget the manual-review budget constraint.

5Кейс10 мин

ML System Design

You collected months of human-reviewer decisions for task outputs. How could you use this data to improve the automatic checker?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Create supervised examples from task/spec/output/reviewer errors, clean and deduplicate labels, split by time/task/customer, then train classifiers, rerankers or fine-tune an LLM for structured error detection.

Подробный разбор

Reviewer data can become supervised training data: input is the task spec, worker output and evidence; target is the reviewer decision plus structured error list. Before training, normalize taxonomies, remove low-quality/disputed labels, deduplicate near-identical tasks and protect customer-sensitive fields.

Start with simpler models where possible: error-type classifiers, risk scoring models or retrieval of similar past failures. For LLMs, supervised fine-tuning can teach the desired output schema and error wording, while preference data can rank better explanations or reduce false accepts.

Validation must be time-aware and category-aware. Split by time, customer or task type to avoid memorizing templates. Track false accept rate, false reject rate, manual-review load and per-category degradation. Keep humans in the loop for uncertain or high-risk outputs.