Назад к подготовке

ВопросСредняяmodel-serving-optimizationВопрос про production ML на техническом собеседовании · Navio

Вопрос про production ML

Explain at a high level how TensorRT or similar inference optimizers speed up neural networks, and why INT8 quantization usually needs calibration.

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Optimizers compile a static graph for target hardware, fuse operators, choose kernels and reduce precision. INT8 needs calibration to map real activation ranges into 8-bit values without destroying accuracy.

Полный разбор

ONNX is commonly used as an interchange format that describes a static computation graph and a set of operations. TensorRT can take such a graph and build an engine tuned for a specific NVIDIA GPU. The speedups come from operator fusion, kernel selection, memory-layout planning, constant folding, removing unnecessary work and using lower precision where safe. FP16 quantization is often relatively straightforward because it keeps floating-point semantics with fewer bits. INT8 is more delicate: activations and weights must be mapped into a small integer range. Calibration uses representative data to estimate activation ranges or distributions and choose scales and zero points. Without this, outliers and poorly chosen ranges can destroy model quality. A good production answer also mentions dynamic shapes and unsupported operators as common reasons conversions fail or become slower than expected.

Теория

Inference compilation is graph-level and hardware-specific; quantization trades numerical precision for speed and memory.

Типичные ошибки

Treat ONNX and TensorRT as the same thing.
Say quantization only saves disk size.
Ignore representative calibration data.

Как отвечать на собеседовании

Separate graph optimizations from numeric precision changes.
Call out validation after conversion as mandatory.