Вопрос про production ML
A 72B-parameter LLM is served on an A100 80GB. Estimate whether FP16 fits and explain what quantization changes.
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
FP16 weights alone need about 144GB for 72B parameters, before KV cache and runtime overhead. An 80GB A100 requires quantization, sharding, offload or a smaller model; INT4 weights are roughly 36GB plus overhead.
Полный разбор
The back-of-the-envelope calculation is parameter_count multiplied by bytes per parameter. In FP16 or BF16, each parameter is 2 bytes, so 72 billion parameters need about 144GB just for weights. This excludes KV cache, activations during prefill, CUDA/runtime overhead, tokenizer buffers and serving framework memory.
An 80GB A100 therefore cannot host a full 72B model in FP16 on one GPU. Common ways to make it work are tensor/model parallelism across multiple GPUs, CPU/NVMe offload, using a smaller model, or quantization.
With INT4 weight-only quantization, the raw weight storage is roughly 0.5 bytes per parameter, so 72B weights are about 36GB before scales, metadata and framework overhead. In practice, a quantized 72B may fit on a single 80GB GPU for limited batch/context settings, but KV cache can still dominate if context length or concurrency is high.
Теория
Serving feasibility depends on weights plus KV cache and overhead; quantization reduces weight memory, not all memory costs.
Типичные ошибки
- Say 72B FP16 fits into 80GB because 72 is less than 80.
- Forget that FP16 uses two bytes per parameter.
- Ignore KV cache when discussing long context or high concurrency.
Как отвечать на собеседовании
- Do the mental math out loud: 72B * 2 bytes = 144GB.
- After correcting yourself, say what assumption changed, for example FP16 to INT4.