ML System Design
How would you build item embeddings from text, images/video and categorical/numerical attributes under real serving constraints?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Use lightweight modality encoders, normalize/project each modality, fuse into one item vector and precompute offline when possible. CLIP-like models can align image/text; attributes need embeddings or compact projections.
Полный разбор
A practical design uses separate encoders per modality. Text can use a small BERT/Sentence-BERT-like encoder. Images or key video frames can use EfficientNet, ViT or CLIP visual towers. Video may be represented by sampled frames, scene-level embeddings, audio/subtitle signals or a heavier video model if latency and cost allow.
Categorical attributes should not become huge sparse one-hot vectors at serving time. Map them to learned embeddings or compact projections. Numerical features can be normalized, bucketized or embedded with small feature encoders.
Fuse modalities by concatenation plus projection, weighted averaging in a shared space, or a small MLP/Transformer over modality vectors. Normalize final embeddings for cosine search. For 1M items and large traffic, precompute item embeddings and ANN index offline or nearline; do not run heavy encoders per online request unless the product really requires it.
Теория
The design balances representation quality with refresh, index size, latency and hardware cost.
Типичные ошибки
- Feed every video frame online through a large model.
- Concatenate arbitrary embeddings without normalization or projection.
- Ignore missing modalities.
Как отвечать на собеседовании
- Name concrete model families but keep them sized for serving.
- Say how to handle text, image/video, categorical and numeric features separately.