Назад к подготовке

Вопрос про production ML

Explain at a high level how TensorRT or similar inference optimizers speed up neural networks, and why INT8 quantization usually needs calibration.

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Optimizers compile a static graph for target hardware, fuse operators, choose kernels and reduce precision. INT8 needs calibration to map real activation ranges into 8-bit values without destroying accuracy.

Полный разбор

ONNX is commonly used as an interchange format that describes a static computation graph and a set of operations. TensorRT can take such a graph and build an engine tuned for a specific NVIDIA GPU. The speedups come from operator fusion, kernel selection, memory-layout planning, constant folding, removing unnecessary work and using lower precision where safe.

FP16 quantization is often relatively straightforward because it keeps floating-point semantics with fewer bits. INT8 is more delicate: activations and weights must be mapped into a small integer range. Calibration uses representative data to estimate activation ranges or distributions and choose scales and zero points. Without this, outliers and poorly chosen ranges can destroy model quality.

A good production answer also mentions dynamic shapes and unsupported operators as common reasons conversions fail or become slower than expected.

Теория

Inference compilation is graph-level and hardware-specific; quantization trades numerical precision for speed and memory.

Типичные ошибки

  • Treat ONNX and TensorRT as the same thing.
  • Say quantization only saves disk size.
  • Ignore representative calibration data.

Как отвечать на собеседовании

  • Separate graph optimizations from numeric precision changes.
  • Call out validation after conversion as mandatory.