Назад к подготовке

ВопросСредняяllm-inferenceВопрос про production ML из материалов интервью · Apriori

Вопрос про production ML

A 72B-parameter LLM is served on an A100 80GB. Estimate whether FP16 fits and explain what quantization changes.

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

FP16 weights alone need about 144GB for 72B parameters, before KV cache and runtime overhead. An 80GB A100 requires quantization, sharding, offload or a smaller model; INT4 weights are roughly 36GB plus overhead.

Полный разбор

The back-of-the-envelope calculation is parameter_count multiplied by bytes per parameter. In FP16 or BF16, each parameter is 2 bytes, so 72 billion parameters need about 144GB just for weights. This excludes KV cache, activations during prefill, CUDA/runtime overhead, tokenizer buffers and serving framework memory. An 80GB A100 therefore cannot host a full 72B model in FP16 on one GPU. Common ways to make it work are tensor/model parallelism across multiple GPUs, CPU/NVMe offload, using a smaller model, or quantization. With INT4 weight-only quantization, the raw weight storage is roughly 0.5 bytes per parameter, so 72B weights are about 36GB before scales, metadata and framework overhead. In practice, a quantized 72B may fit on a single 80GB GPU for limited batch/context settings, but KV cache can still dominate if context length or concurrency is high.

Теория

Serving feasibility depends on weights plus KV cache and overhead; quantization reduces weight memory, not all memory costs.

Типичные ошибки

Say 72B FP16 fits into 80GB because 72 is less than 80.
Forget that FP16 uses two bytes per parameter.
Ignore KV cache when discussing long context or high concurrency.

Как отвечать на собеседовании

Do the mental math out loud: 72B * 2 bytes = 144GB.
After correcting yourself, say what assumption changed, for example FP16 to INT4.