Назад к подготовке

Вопрос про production ML

A deployed ML service has 300 ms latency, but the product now needs 30 ms. What do you investigate and what optimizations can you try?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Profile the whole request path first, then optimize the bottleneck: model format, batching, quantization, hardware, caching, feature fetching, network calls and concurrency.

Полный разбор

Start with measurement. Break latency into feature fetching, serialization, network hops, queueing, preprocessing, model inference, postprocessing and downstream calls. Look at p50, p95, p99 and traffic bursts.

Model-level options include ONNX, TensorRT, TorchScript, quantization, FP16, pruning, distillation, smaller architecture and optimized preprocessing. Serving-level options include right-sized hardware, concurrency tuning, warm workers and efficient serialization.

System-level options can matter more than the model: cache stable predictions or features, precompute offline, move feature lookup closer to the service and remove unnecessary synchronous calls. More pods help throughput and queueing, but not single-request compute latency if the bottleneck is the model itself.

Теория

Latency optimization is profile-driven; horizontal scaling does not solve every p99 bottleneck.

Типичные ошибки

  • Add more pods before profiling.
  • Ignore feature-store or network latency.
  • Optimize average latency while p99 still violates SLA.

Как отвечать на собеседовании

  • Say "profile first" before listing optimizations.
  • Separate throughput scaling from single-request latency.