Обязательно

Inference Optimization Foundations

Latency, throughput, memory, cost, profiling, bottleneck attribution, batching trade-offs and hardware-aware thinking.

Время изучения: 28 мин

Inference Optimization Foundations

Profiling-driven optimization: latency, throughput, memory, cost, p50/p95/p99, bottleneck attribution and safe benchmark design.

Что должен уметь кандидат

  • Separate latency, throughput, memory and cost goals.
  • Use profiler traces to identify compute, memory, IO or scheduling bottlenecks.
  • Understand batching trade-offs for p95 latency and utilization.
  • Avoid benchmark claims without hardware, batch, precision and workload context.

Что спрашивают на собеседовании

  • How would you reduce p95 latency by 3x?
  • What if GPU utilization is low but queue is high?
  • How do you design a fair inference benchmark?

Практическая задача

Create benchmark harness for one model with varying batch/concurrency/input size and report p50/p95, throughput, memory and cost assumptions.

Source-grounded правило

Performance claims must include hardware, precision, batch shape, runtime and model version.