Назад к подготовке

ВопросСложнаяrag-evaluationRAG-вопрос на техническом собеседовании · Parloa

Как сравнить два LLM для customer support automation

Есть реальный продуктовый use case: customer support automation. Нужно сравнить два LLM/agent variants и выбрать, какой запускать. Как спроектировать evaluation: данные, offline metrics, human/LLM judging, system metrics и online validation?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Сначала определить product success: correct resolution, safe answer, escalation when needed, user satisfaction, latency and cost. Затем собрать representative ticket eval set with gold resolutions and policy labels, прогнать оба LLM в одинаковых условиях, оценить retrieval, final answer, routing/escalation, hallucination and tone, проверить human/LLM judges, и только потом запускать canary/A-B on real traffic.

Полный разбор

Теория

Support automation evaluation combines RAG quality, generation faithfulness, policy/routing correctness and product outcome. Offline benchmarks reduce risk, but model choice must be validated on production traffic and sliced by risk/intent/language.

Типичные ошибки

Сравнить модели на generic benchmark вместо реальных tickets and policies.
Поменять одновременно model, prompt, retriever and tools, поэтому невозможно понять источник улучшения.
Смотреть только containment rate и игнорировать repeat contact, hallucinations and unsafe non-escalation.
Не делать slice analysis for high-risk or long-tail support cases.

Как отвечать на собеседовании

Начни с product success and support-specific risks.
Назови fair offline setup: same prompts, same retriever, same parameters.
Заверши online A/B and possibility of routing between models rather than picking one globally.