Inference Optimization Foundations
Profiling-driven optimization: latency, throughput, memory, cost, p50/p95/p99, bottleneck attribution and safe benchmark design.
Что должен уметь кандидат
- Separate latency, throughput, memory and cost goals.
- Use profiler traces to identify compute, memory, IO or scheduling bottlenecks.
- Understand batching trade-offs for p95 latency and utilization.
- Avoid benchmark claims without hardware, batch, precision and workload context.
Что спрашивают на собеседовании
- How would you reduce p95 latency by 3x?
- What if GPU utilization is low but queue is high?
- How do you design a fair inference benchmark?
Практическая задача
Create benchmark harness for one model with varying batch/concurrency/input size and report p50/p95, throughput, memory and cost assumptions.
Source-grounded правило
Performance claims must include hardware, precision, batch shape, runtime and model version.