Вопрос про production ML
Explain at a high level how TensorRT or similar inference optimizers speed up neural networks, and why INT8 quantization usually needs calibration.
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Optimizers compile a static graph for target hardware, fuse operators, choose kernels and reduce precision. INT8 needs calibration to map real activation ranges into 8-bit values without destroying accuracy.
Полный разбор
ONNX is commonly used as an interchange format that describes a static computation graph and a set of operations. TensorRT can take such a graph and build an engine tuned for a specific NVIDIA GPU. The speedups come from operator fusion, kernel selection, memory-layout planning, constant folding, removing unnecessary work and using lower precision where safe.
FP16 quantization is often relatively straightforward because it keeps floating-point semantics with fewer bits. INT8 is more delicate: activations and weights must be mapped into a small integer range. Calibration uses representative data to estimate activation ranges or distributions and choose scales and zero points. Without this, outliers and poorly chosen ranges can destroy model quality.
A good production answer also mentions dynamic shapes and unsupported operators as common reasons conversions fail or become slower than expected.
Теория
Inference compilation is graph-level and hardware-specific; quantization trades numerical precision for speed and memory.
Типичные ошибки
- Treat ONNX and TensorRT as the same thing.
- Say quantization only saves disk size.
- Ignore representative calibration data.
Как отвечать на собеседовании
- Separate graph optimizations from numeric precision changes.
- Call out validation after conversion as mandatory.