Runtime Optimization Stack
ONNX Runtime, TensorRT, Triton, TensorRT-LLM, torch.compile and quantization as a layered optimization toolkit.
Что должен уметь кандидат
- Know when ONNX export is worth it and where dynamic models make it hard.
- Explain TensorRT engine constraints, calibration and hardware specificity.
- Understand Triton as serving layer, not magic model optimizer.
- Compare compile/export/quantization paths by engineering cost and rollout risk.
Что спрашивают на собеседовании
- When would TensorRT beat plain PyTorch?
- Why can TensorRT export fail?
- What does Triton solve and what does it not solve?
Практическая задача
Take one PyTorch model and document optimization attempts: torch.compile, ONNX Runtime, TensorRT/Triton feasibility, plus blockers.
Source-grounded правило
Do not promise fixed speedups; optimization gains depend heavily on model graph, kernels, precision and hardware.