Начать в Telegram

Обязательно

Runtime Optimization Stack

ONNX Runtime, TensorRT, Triton, torch.compile, quantization and when each layer of the stack is worth the complexity.

Время изучения: 32 мин

Runtime Optimization Stack

ONNX Runtime, TensorRT, Triton, TensorRT-LLM, torch.compile and quantization as a layered optimization toolkit.

Что должен уметь кандидат

Know when ONNX export is worth it and where dynamic models make it hard.
Explain TensorRT engine constraints, calibration and hardware specificity.
Understand Triton as serving layer, not magic model optimizer.
Compare compile/export/quantization paths by engineering cost and rollout risk.

Что спрашивают на собеседовании

When would TensorRT beat plain PyTorch?
Why can TensorRT export fail?
What does Triton solve and what does it not solve?

Практическая задача

Take one PyTorch model and document optimization attempts: torch.compile, ONNX Runtime, TensorRT/Triton feasibility, plus blockers.

Source-grounded правило

Do not promise fixed speedups; optimization gains depend heavily on model graph, kernels, precision and hardware.

Материалы

Дополнительно

NVIDIA TensorRT Documentation

NVIDIA Triton Inference Server Docs

TensorRT-LLM Documentation

Предыдущая тема

Inference Optimization Foundations

Следующая тема

High-Load Serving Patterns