RAG-вопрос
A векторный поиск returns top-k nearest items, but all results are too similar to each other. How can you keep relevance while increasing diversity?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Retrieve a larger candidate pool, then rerank with a diversity-aware objective such as MMR, clustering, per-attribute quotas or multi-vector prefetch/rerank. Keep a relevance floor so diversity does not produce bad matches.
Подробный разбор
Nearest-neighbor search optimizes similarity to the query, not diversity among results. If the top 20 are all near-duplicates, first retrieve more candidates than you need. Then rerank with an objective that balances query relevance and dissimilarity to already selected items.
Maximum Marginal Relevance is a standard approach: at each step choose an item with high similarity to the query and low similarity to selected items. Alternatives include clustering candidates and picking representatives, applying quotas over attributes such as color/brand/category, or using a two-stage retriever where one vector gets broad candidates and another reranks.
In systems such as Qdrant, prefetch/multi-vector retrieval can support this pattern: retrieve from several embeddings or filters, merge candidates and rerank. The key production point is to tune diversity with offline relevance metrics and online user behavior; diversity is useful only if relevance remains acceptable.
Вопрос про production ML
How can you increase LLM serving throughput or batch size on the same GPU without buying a larger GPU?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Reduce memory per request and improve scheduling: quantization, lower precision, paged/KV-cache management, continuous batching, shorter prompts, smaller max tokens and optimized runtimes such as vLLM/TensorRT-LLM.
Подробный разбор
LLM batch size is often limited by weights plus KV-cache memory. If the model fits but concurrent requests do not, reduce memory per request: use FP16/BF16 instead of FP32, weight quantization such as INT8/INT4 where quality allows, and limit prompt/output lengths.
Serving runtimes matter. Continuous batching groups requests dynamically, while paged attention/KV-cache management reduces fragmentation and improves utilization. vLLM, TensorRT-LLM or similar runtimes can deliver higher throughput than naive sequential generation.
Also attack traffic shape: cache repeated prefixes, route simple tasks to smaller models, stream outputs, set per-route max token budgets and batch only where latency SLA allows. More throughput is a product/SLA trade-off, not only a GPU trick.
Вопрос
Explain the difference between BERT and GPT in terms of Transformer architecture and training objective.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
BERT is an encoder-style bidirectional Transformer trained mainly with masked-language modeling for representation tasks. GPT is a decoder-only causal Transformer trained to predict the next token for generation.
Подробный разбор
BERT uses Transformer encoder blocks with bidirectional self-attention, so each token can attend to tokens on both sides during pretraining. Its classical objective is masked language modeling: hide some tokens and predict them from context. This makes BERT strong for classification, retrieval embeddings, NER and other understanding tasks after fine-tuning.
GPT-style models use decoder-only Transformer blocks with causal attention: each token can attend only to previous tokens. The objective is next-token prediction. This aligns directly with text/code generation and chat completion, so modern LLMs are mostly GPT-like decoder-only architectures.
Both use attention, residual connections, normalization and feed-forward layers, but the masking pattern and training objective drive their behavior. In an interview, state encoder/bidirectional/MLM for BERT and decoder-only/causal/next-token for GPT early.
Типичные ошибки
- Say BERT cannot be fine-tuned for generation-like tasks but ignore its main representation role.
- Describe GPT as encoder-decoder by default.
- Forget the causal mask in GPT.