Вопрос про production ML
How can you increase LLM serving throughput or batch size on the same GPU without buying a larger GPU?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Reduce memory per request and improve scheduling: quantization, lower precision, paged/KV-cache management, continuous batching, shorter prompts, smaller max tokens and optimized runtimes such as vLLM/TensorRT-LLM.
Полный разбор
LLM batch size is often limited by weights plus KV-cache memory. If the model fits but concurrent requests do not, reduce memory per request: use FP16/BF16 instead of FP32, weight quantization such as INT8/INT4 where quality allows, and limit prompt/output lengths.
Serving runtimes matter. Continuous batching groups requests dynamically, while paged attention/KV-cache management reduces fragmentation and improves utilization. vLLM, TensorRT-LLM or similar runtimes can deliver higher throughput than naive sequential generation.
Also attack traffic shape: cache repeated prefixes, route simple tasks to smaller models, stream outputs, set per-route max token budgets and batch only where latency SLA allows. More throughput is a product/SLA trade-off, not only a GPU trick.