Назад к подготовке

Вопрос про production ML

A 72B-parameter LLM is served on an A100 80GB. Estimate whether FP16 fits and explain what quantization changes.

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

FP16 weights alone need about 144GB for 72B parameters, before KV cache and runtime overhead. An 80GB A100 requires quantization, sharding, offload or a smaller model; INT4 weights are roughly 36GB plus overhead.

Полный разбор

The back-of-the-envelope calculation is parameter_count multiplied by bytes per parameter. In FP16 or BF16, each parameter is 2 bytes, so 72 billion parameters need about 144GB just for weights. This excludes KV cache, activations during prefill, CUDA/runtime overhead, tokenizer buffers and serving framework memory.

An 80GB A100 therefore cannot host a full 72B model in FP16 on one GPU. Common ways to make it work are tensor/model parallelism across multiple GPUs, CPU/NVMe offload, using a smaller model, or quantization.

With INT4 weight-only quantization, the raw weight storage is roughly 0.5 bytes per parameter, so 72B weights are about 36GB before scales, metadata and framework overhead. In practice, a quantized 72B may fit on a single 80GB GPU for limited batch/context settings, but KV cache can still dominate if context length or concurrency is high.

Теория

Serving feasibility depends on weights plus KV cache and overhead; quantization reduces weight memory, not all memory costs.

Типичные ошибки

  • Say 72B FP16 fits into 80GB because 72 is less than 80.
  • Forget that FP16 uses two bytes per parameter.
  • Ignore KV cache when discussing long context or high concurrency.

Как отвечать на собеседовании

  • Do the mental math out loud: 72B * 2 bytes = 144GB.
  • After correcting yourself, say what assumption changed, for example FP16 to INT4.