Вопрос про production ML

A neural network inference pipeline is too slow. What optimizations would you consider before changing the model architecture?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Check batching, preprocessing bottlenecks, device utilization, mixed precision/quantization, ONNX export, TensorRT/OpenVINO-style runtimes, caching and async pipelines before redesigning the model.

Полный разбор

First profile. Separate preprocessing, model forward pass, postprocessing and I/O. Many pipelines are slow because GPU utilization is low, batches are too small, CPU preprocessing blocks the GPU, or data transfer dominates. Common optimizations include larger or dynamic batching, async queues, pinned memory, mixed precision, quantization, operator fusion and exporting to ONNX. Runtime engines such as TensorRT, OpenVINO or ONNX Runtime can compile/fuse kernels and exploit target hardware better than eager PyTorch in many serving workloads. Also consider caching repeated inputs, pruning unused outputs, simplifying postprocessing and setting clear latency-throughput targets. Batch size increases throughput but can hurt tail latency, so the correct setting depends on SLA and traffic shape.