ML System Design
Design a similar-items recommender for 1M items when the current collaborative model fails on cold-start items and misses semantic similarity.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Build item embeddings from content modalities, retrieve nearest neighbors with ANN, then refine with feedback-trained ranking or metric learning. Keep collaborative signals as features where available, but do not depend on them for new items.
Подробный разбор
Start by clarifying the action. This is item-to-item retrieval: for an anchor item, return similar and useful items. The existing collaborative model has cold-start and semantic-similarity gaps, so the first layer should use content that exists for new items: text, images/video frames, metadata, genres, actors, tags and behavioral aggregates when available.
Generate item embeddings with modality-specific encoders and store them in an ANN index. For cold-start items, content embeddings are enough to produce neighbors. For warm items, add collaborative/feedback signals in a reranker or fine-tuning stage.
The target should not be only visual/textual similarity. The prompt explicitly asks for user feedback, so train or rerank using positives such as clicks, watches, purchases, likes or long watch time, and negatives such as skips, dislikes or exposed-but-ignored items. The system should output candidates, rerank them, filter business constraints and log impressions for evaluation.
Типичные ошибки
- Use only collaborative filtering despite the cold-start requirement.
- Return nearest visual neighbors without user usefulness signals.
- Forget ANN/index refresh for new items.
Как сказать на собеседовании
- Separate retrieval, training data, reranking and evaluation.
- Mention what happens for brand-new items on day one.
ML System Design
How would you build positives and negatives for training a similar-items model, and what loss would you use?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Mine candidate pairs from pretrained/content neighbors and interaction logs, label positives/negatives with human/VLM/behavioral signals, then train with contrastive, triplet or sampled softmax objectives using random and hard negatives.
Подробный разбор
For positives, use multiple sources: co-watch/co-click/co-purchase patterns, editorial/manual similar-item labels, items in the same franchise/genre/entity clusters, and VLM or human labels for semantic similarity. For new items without interactions, content-neighbor mining can generate candidates for labeling.
Negatives should not be only random items from the full catalog. Random negatives are too easy in a million-item catalog. Mix random negatives with hard negatives: close ANN neighbors judged not similar, same-category but wrong intent, high-popularity distractors and exposed-but-skipped items.
Loss choices include contrastive loss, triplet loss, InfoNCE/in-batch negatives or sampled softmax. In-batch negatives make training efficient, while mined hard negatives make the task meaningful. Keep train/validation splits careful so that the model does not memorize near-duplicate labels or future feedback.
Типичные ошибки
- Use only random negatives from a huge catalog.
- Treat all unobserved pairs as negative without exposure context.
- Forget that manual/VLM labels need quality checks.
Как сказать на собеседовании
- Say random negatives for coverage and hard negatives for discrimination.
- Mention in-batch negatives as an efficient default.
ML System Design
How would you build item embeddings from text, images/video and categorical/numerical attributes under real serving constraints?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Use lightweight modality encoders, normalize/project each modality, fuse into one item vector and precompute offline when possible. CLIP-like models can align image/text; attributes need embeddings or compact projections.
Подробный разбор
A practical design uses separate encoders per modality. Text can use a small BERT/Sentence-BERT-like encoder. Images or key video frames can use EfficientNet, ViT or CLIP visual towers. Video may be represented by sampled frames, scene-level embeddings, audio/subtitle signals or a heavier video model if latency and cost allow.
Categorical attributes should not become huge sparse one-hot vectors at serving time. Map them to learned embeddings or compact projections. Numerical features can be normalized, bucketized or embedded with small feature encoders.
Fuse modalities by concatenation plus projection, weighted averaging in a shared space, or a small MLP/Transformer over modality vectors. Normalize final embeddings for cosine search. For 1M items and large traffic, precompute item embeddings and ANN index offline or nearline; do not run heavy encoders per online request unless the product really requires it.
Типичные ошибки
- Feed every video frame online through a large model.
- Concatenate arbitrary embeddings without normalization or projection.
- Ignore missing modalities.
Как сказать на собеседовании
- Name concrete model families but keep them sized for serving.
- Say how to handle text, image/video, categorical and numeric features separately.
ML System Design
Which offline and online metrics would you use for a similar-items recommender, and what pitfalls are easy to miss?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Use Recall@K, Precision@K, NDCG, coverage/diversity and latency offline, but evaluate against meaningful candidate sets with hard negatives. Online, use user-level A/B metrics such as CTR, watch time, conversion, retention and guardrails.
Подробный разбор
Offline retrieval metrics include Recall@K and HitRate@K against labeled positives. Ranking metrics include Precision@K, NDCG@K and MRR when positions matter. Add coverage, novelty/diversity, catalog freshness, cold-start slice metrics, latency and index freshness.
The big pitfall is evaluation against easy random negatives. If the candidate set contains one positive and thousands of random unrelated items, almost any reasonable model can look excellent. Use hard-negative candidate pools, judged pairs and realistic retrieval/reranking sets.
Online metrics should reflect product value: CTR on similar items, watch starts, watch time, purchases/subscriptions where relevant, add-to-list, downstream retention and revenue. Split by users for A/B tests in most consumer scenarios, define guardrails for bad recommendations, latency and content policy, and precompute MDE/power before launching.
Типичные ошибки
- Report only Recall@K on random negatives.
- Ignore cold-start and catalog-coverage slices.
- A/B split by item when user-level interference is the main concern.
Как сказать на собеседовании
- Mention NDCG and then immediately discuss the candidate-set pitfall.
- Separate offline ranking quality from online business impact.
ML System Design
If you train on feedback from the previous recommender, what biases can appear and how can you reduce them?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
The logged feedback is biased by what the old model chose to expose, item popularity and positions. Mitigate with exploration traffic, exposure-aware labels, propensity weighting, randomized buckets and slice monitoring.
Подробный разбор
A recommender only observes feedback for items it exposed. That creates selection bias. Popular items get more exposure, high positions get more clicks, and items the old model never showed may look bad only because nobody saw them.
Mitigations start with logging. Store impressions, positions, candidate source, user/item context and whether the item was actually visible. Treat "no click" differently when there was no exposure.
Add controlled exploration: small random or diversified traffic, epsilon-greedy buckets, interleaving, or candidate-source mixing. Use propensity weighting or debiasing methods when estimating offline value from logged data. Monitor popularity, novelty, category and cold-start slices so the model does not collapse into safe popular recommendations.
For item-to-item recommendations, also separate global item similarity from personalized feedback. If the product wants the same similar items for all users, aggregate feedback at the anchor-item pair level; if it wants personalization, introduce user context explicitly.
Типичные ошибки
- Treat every unclicked item as a true negative without exposure context.
- Ignore position bias.
- Add exploration without measuring user-impact guardrails.
Как сказать на собеседовании
- Say "previous model exposure bias" directly.
- Bring up impression logging before fancy debiasing methods.