Вопрос про production ML

How can you increase LLM serving throughput or batch size on the same GPU without buying a larger GPU?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Reduce memory per request and improve scheduling: quantization, lower precision, paged/KV-cache management, continuous batching, shorter prompts, smaller max tokens and optimized runtimes such as vLLM/TensorRT-LLM.

Полный разбор

LLM batch size is often limited by weights plus KV-cache memory. If the model fits but concurrent requests do not, reduce memory per request: use FP16/BF16 instead of FP32, weight quantization such as INT8/INT4 where quality allows, and limit prompt/output lengths. Serving runtimes matter. Continuous batching groups requests dynamically, while paged attention/KV-cache management reduces fragmentation and improves utilization. vLLM, TensorRT-LLM or similar runtimes can deliver higher throughput than naive sequential generation. Also attack traffic shape: cache repeated prefixes, route simple tasks to smaller models, stream outputs, set per-route max token budgets and batch only where latency SLA allows. More throughput is a product/SLA trade-off, not only a GPU trick.