Начать в Telegram

Обязательно

Inference Optimization Foundations

Latency, throughput, memory, cost, profiling, bottleneck attribution, batching trade-offs and hardware-aware thinking.

Время изучения: 28 мин

Inference Optimization Foundations

Profiling-driven optimization: latency, throughput, memory, cost, p50/p95/p99, bottleneck attribution and safe benchmark design.

Что должен уметь кандидат

Separate latency, throughput, memory and cost goals.
Use profiler traces to identify compute, memory, IO or scheduling bottlenecks.
Understand batching trade-offs for p95 latency and utilization.
Avoid benchmark claims without hardware, batch, precision and workload context.

Что спрашивают на собеседовании

How would you reduce p95 latency by 3x?
What if GPU utilization is low but queue is high?
How do you design a fair inference benchmark?

Практическая задача

Create benchmark harness for one model with varying batch/concurrency/input size and report p50/p95, throughput, memory and cost assumptions.

Source-grounded правило

Performance claims must include hardware, precision, batch shape, runtime and model version.

Материалы

Дополнительно

PyTorch Profiler Recipe

torch.compile — PyTorch Docs

ONNX Runtime Graph Optimizations

Предыдущая тема

GenAI Evaluation

Следующая тема

Runtime Optimization Stack