Вопрос про production ML

You have a multi-GPU server and want to host one or more open-source LLMs. What software stack and design choices would you use?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Use a serving runtime such as vLLM or TensorRT-LLM, choose model size/quantization to fit weights plus KV cache, expose an API layer, monitor latency/GPU memory, and route tasks to suitable models.

Полный разбор

Start by sizing the model. GPU memory must cover weights, KV cache and runtime overhead. For multiple models, decide whether to shard one large model across GPUs or host several smaller models. Quantization can reduce memory but must be checked for quality. Use an inference runtime built for LLM serving, such as vLLM, TensorRT-LLM, TGI or llama.cpp for smaller CPU/GPU cases. vLLM-style continuous batching and paged attention improve throughput and KV-cache utilization compared with a naive Transformers loop. Wrap the runtime with an API layer, authentication, request limits, prompt/token budgets, logging and monitoring. Track time to first token, tokens/sec, queue time, GPU memory, error rate and per-route cost. If product tasks differ, route extraction, chat and long-context jobs to different model sizes or configs.