RAG-вопрос
Explain how LLM tool/function calling works end to end: tool schema in the prompt, model output, real tool execution and final user response.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
The orchestration layer exposes tool schemas to the model, the model emits a structured tool call, the runtime parses and executes the real function, then appends the tool result back to context so the model can answer the user.
Подробный разбор
Tool calling is not the model magically accessing external systems. The application or provider runtime first tells the model what tools exist: names, descriptions, argument schemas and rules for how to emit calls. This may be represented as provider-native tool metadata or as prompt instructions plus a special output format.
When the user request requires external action, the model emits a structured tool call such as a function name plus JSON arguments. The orchestration layer validates the schema, parses the arguments and calls the real function or API. For a weather example, the LLM does not fetch weather itself; the runtime calls the weather API.
The tool result is then added back into the conversation as a tool message or equivalent context item. The model receives that result and produces the final natural-language answer. Production systems also need retries, validation, authorization, rate limits, tool-result truncation, logging and safeguards against prompt injection through tool outputs.
Типичные ошибки
- Describe tools as if the LLM directly calls APIs by itself.
- Skip argument validation and error handling.
- Forget the second model call that converts tool results into the final answer.
Как сказать на собеседовании
- Draw the loop: user message -> model tool call -> runtime execution -> tool result -> final model response.
- Mention both schema prompting and real execution; interviewers often probe that boundary.
Вопрос про production ML
A 72B-parameter LLM is served on an A100 80GB. Estimate whether FP16 fits and explain what quantization changes.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
FP16 weights alone need about 144GB for 72B parameters, before KV cache and runtime overhead. An 80GB A100 requires quantization, sharding, offload or a smaller model; INT4 weights are roughly 36GB plus overhead.
Подробный разбор
The back-of-the-envelope calculation is parameter_count multiplied by bytes per parameter. In FP16 or BF16, each parameter is 2 bytes, so 72 billion parameters need about 144GB just for weights. This excludes KV cache, activations during prefill, CUDA/runtime overhead, tokenizer buffers and serving framework memory.
An 80GB A100 therefore cannot host a full 72B model in FP16 on one GPU. Common ways to make it work are tensor/model parallelism across multiple GPUs, CPU/NVMe offload, using a smaller model, or quantization.
With INT4 weight-only quantization, the raw weight storage is roughly 0.5 bytes per parameter, so 72B weights are about 36GB before scales, metadata and framework overhead. In practice, a quantized 72B may fit on a single 80GB GPU for limited batch/context settings, but KV cache can still dominate if context length or concurrency is high.
Типичные ошибки
- Say 72B FP16 fits into 80GB because 72 is less than 80.
- Forget that FP16 uses two bytes per parameter.
- Ignore KV cache when discussing long context or high concurrency.
Как сказать на собеседовании
- Do the mental math out loud: 72B * 2 bytes = 144GB.
- After correcting yourself, say what assumption changed, for example FP16 to INT4.
Как работает LoRA fine-tuning
Как работает LoRA fine-tuning
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
LoRA freezes the base model and learns small low-rank matrices whose product is added to selected linear layers. Only the adapter weights get gradients, so optimizer state and trainable memory are much smaller.
Подробный разбор
A dense linear layer has a weight matrix W. Full fine-tuning updates W directly, which is expensive for large LLMs because gradients and optimizer state must be stored for many parameters.
LoRA instead freezes W and learns a low-rank update Delta W = B A, where A and B have rank r much smaller than the original dimensions. During forward pass, the layer behaves like W x plus the adapter contribution B A x, often scaled by a LoRA alpha factor. Common target layers are attention projections and sometimes MLP projections.
Because only A and B are trainable, memory and compute for optimizer state are much lower. At deployment time, adapters can be kept separate and swapped per task/tenant, or merged into the base weights for simpler inference. The trade-off is that LoRA capacity depends on rank, target modules and data quality; it is not a replacement for all full fine-tuning cases.
Типичные ошибки
- Say LoRA trains a separate small model unrelated to the base model.
- Forget that the base weights are usually frozen.
- Assume LoRA always has no inference cost; separate adapters can add operational complexity.
Как сказать на собеседовании
- Use the formula Delta W = B A to make the answer concrete.
- Mention adapter swapping or merging if the role involves multi-tenant model serving.
Вопрос
Describe how you would train and validate a transformer-style reranking model for marketplace recommendations.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Build candidate lists, create labels from interactions, sample negatives carefully, train a sequence/cross-feature ranking model, validate offline with ranking metrics and ship only through guarded online experiments.
Подробный разбор
A reranker operates after candidate generation. Start by defining what it ranks: item-item recommendations, user-item candidates, search results or session-based candidates. Training data usually comes from impressions/clicks/orders/favorites/watch events, with time-based splits to avoid leakage.
Negative sampling matters. Random negatives are easy but often too easy; sampled displayed-but-not-clicked items, hard negatives from the retrieval stage and category-aware negatives can make the task closer to production. The model can be a sequence model such as SASRec/BERT4Rec, a two-tower plus cross features, or another transformer-style reranker depending on latency budget.
Offline metrics should match the ranked-list behavior: NDCG@K, Recall@K, MRR, MAP or task-specific conversion proxies. Offline wins are not enough because logs are biased by previous rankers and UI exposure. The final decision needs online A/B metrics such as CTR, conversion, retention, revenue or marketplace guardrails.
Типичные ошибки
- Train on future interactions by accident.
- Use random negatives only and get a misleadingly easy offline task.
- Optimize NDCG offline but ignore online business guardrails.
Как сказать на собеседовании
- Say what the candidate generator provides and what the reranker changes.
- Name both ranking metrics and the online product metric used for launch.
Файлы ML-модели, упаковка сервиса и безопасный rollout
Вы обучили и провалидировали ML-модель. Какие файлы и метаданные нужно версионировать, как упаковать сервис и как безопасно выкатить новую версию?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Version model weights, code, config, data/training metadata and evaluation results; package the inference service in a reproducible image or artifact setup; deploy through staging, canary/rolling rollout and monitoring.
Подробный разбор
A deployable ML model is more than a file with weights. You usually need the model artifact, preprocessing/postprocessing code, dependency versions, feature schema, config, training data snapshot or lineage, evaluation report and a registry entry tying those pieces together. MLflow or a similar registry can store model versions, metrics and promotion state.
Packaging depends on size and operational needs. For small models, embedding weights into the Docker image can make deployment reproducible and simple. For large models, the image may contain only code and download/load the model artifact from object storage or a model registry, because rebuilding or redeploying a huge image for every code change is wasteful.
Serving often uses a Python API such as FastAPI behind Kubernetes or an internal PaaS. A safe rollout goes through staging, smoke tests, canary or rolling update, monitoring of latency/errors/business metrics, rollback hooks and model-version logging in predictions.
Типичные ошибки
- Version weights but not preprocessing code or config.
- Put huge models into every application image without considering rebuild and startup cost.
- Skip rollback and prediction-time model-version logging.
Как сказать на собеседовании
- Answer in the order registry -> package -> deploy -> monitor -> rollback.
- Mention when weights inside the image are acceptable and when separate artifacts are better.
Вопрос про production ML
A production service already has data, but you need to change the database schema. Describe a safe миграцию.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Use versioned migration scripts, test them on representative data, make backward-compatible changes first, backfill safely, deploy code in phases, monitor and keep rollback/forward-fix plans.
Подробный разбор
Schema changes should be explicit and versioned, commonly through tools such as Alembic, Flyway, Liquibase, Django migrations or an internal migration framework. The migration is reviewed like code and tested against a staging copy or representative fixture data.
For low-risk changes, add nullable columns or new tables first, deploy code that can read/write both old and new shapes, backfill in batches, switch reads, then remove old fields in a later release. This expand-and-contract pattern avoids breaking old application versions during rolling deploys.
For large tables, avoid long locks and unbounded transactions. Use online DDL features where available, batch backfills, idempotent scripts, progress checkpoints and observability. Rollback is often a forward fix: you may not be able to safely undo a destructive migration after data has changed, so destructive steps should be delayed until the new path is stable.
Типичные ошибки
- Apply a breaking schema change before deploying compatible application code.
- Run one huge migration on a large table without lock and runtime analysis.
- Treat rollback as simply reversing the SQL after production data has changed.
Как сказать на собеседовании
- Use the phrase expand-and-contract if you know it, then explain it concretely.
- Name a migration tool but focus on the rollout sequence.
Вопрос про production ML
Explain the difference between a Kubernetes pod, service, deployment and node.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
A node is a worker machine, a pod is the smallest schedulable unit containing one or more containers, a deployment manages desired replicas and rollouts, and a service provides stable networking/load balancing to pods.
Подробный разбор
A Kubernetes node is a worker machine, physical or virtual, that runs pods. It has the kubelet, container runtime and networking needed to host workloads.
A pod is the smallest schedulable Kubernetes unit. It usually contains one application container, but can contain tightly coupled sidecars that share network namespace and volumes. Pods are ephemeral; they can be killed and recreated with different IPs.
A deployment is a controller for stateless replicated workloads. It manages a ReplicaSet, desired replica count, rolling updates, rollbacks and pod template changes. You normally update a deployment rather than creating pods manually.
A service gives a stable network identity and load-balancing abstraction over a set of pods selected by labels. Because pod IPs change, clients usually call the service DNS/name rather than individual pods.
Типичные ошибки
- Call a pod a machine rather than a schedulable workload unit.
- Confuse deployment with the act of releasing code.
- Forget that services select pods by labels and hide changing pod IPs.
Как сказать на собеседовании
- Answer with one sentence per primitive, then connect them through a deployment creating pods on nodes behind a service.
- Use a model-serving example if the interviewer comes from ML platform work.
RAG-вопрос
Design the end-to-end сценарий for a RAG system: data preparation, vector index ingestion and serving-time retrieval.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Prepare and chunk data, compute embeddings with metadata, ingest them into an ANN/vector store, retrieve top candidates at query time, rerank/filter them, assemble context and monitor answer quality.
Подробный разбор
Data preparation starts with collecting documents or multimodal items, cleaning them, splitting long text into chunks and attaching metadata such as source, timestamp, access controls and product identifiers. Chunking matters because embedding models have context limits and because retrieval should return useful evidence units rather than huge documents.
Ingestion computes embeddings for chunks or items and writes vectors plus metadata into a vector store such as Qdrant, OpenSearch vector search, FAISS-backed service or another ANN index. HNSW-style indexes are common for approximate nearest-neighbor search. The pipeline needs refresh logic, deletes/updates, versioning and monitoring for failed embeddings.
At serving time, the user query is normalized and embedded, optionally expanded or classified, and used to retrieve candidates with metadata filters and access controls. A reranker can improve relevance. The final context is assembled with citations or source markers and sent to the LLM. Production systems track latency, recall/relevance judgments, hallucination reports and index freshness.
Типичные ошибки
- Skip chunking and metadata, especially permissions and source tracking.
- Assume approximate vector top-k is enough without reranking or filtering.
- Forget update/delete paths and index freshness.
Как сказать на собеседовании
- Structure the answer in the exact three buckets from the prompt: data, ingestion, serving.
- Mention one concrete vector index and one quality metric or evaluation method.