Назад к подготовке

Оптимизация стоимости ASR и LLM-инференса для звонков

Оптимизация стоимости ASR и LLM-инференса для звонков

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Profile the pipeline, skip heavy stages for obvious rejections, batch where possible, quantize ASR/LLM models, trim silence with VAD, cache reusable context and use smaller models for easy cases.

Полный разбор

Optimize from measurements, not guesses. Break down cost by audio transfer, VAD, ASR, diarization, LLM extraction, validation and storage. The largest component determines the first intervention.

Common wins include VAD silence trimming, early rejection classifiers, batching ASR/LLM calls, quantization, faster runtimes such as ONNX where appropriate, smaller specialized models for accept/reject, and selective use of the largest LLM only on hard calls. Also reduce prompt size by retrieving only relevant branch candidates rather than passing the full catalog when it grows.

Track cost per call, latency percentiles, queue depth and quality regressions. A cheaper pipeline that silently drops hard accepted calls is not acceptable; every optimization should be checked against field-level booking metrics.

Теория

Cost optimization in ML systems is a routing and profiling problem before it is a model-compression problem.

Типичные ошибки

  • Quantize everything before profiling.
  • Optimize latency while ignoring quality regression.
  • Send short rejection calls through the full pipeline.
  • Keep growing prompts with static context that could be retrieved.

Как отвечать на собеседовании

  • Start with profiling and per-stage cost.
  • Name early exits and batching as practical first wins.