ML System Design
How would you train a two-tower or CLIP-like text-image recommender using user-post interactions?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Encode user/context and post text-image content into a shared space, train positives from engaged impressions against random, exposed and hard negatives, and optimize contrastive, triplet or sampled-softmax losses.
Полный разбор
A two-tower recommender has a user tower and an item tower. The item tower can combine text and image encoders in a CLIP-like representation; the user tower can aggregate history, profile/context features and recent interactions. At serving time, user embeddings retrieve item embeddings via dot product or cosine similarity.
Positive pairs should come from meaningful interactions such as clicks with dwell, likes, saves, comments or follows. Negatives can include random items, shown-but-not-engaged impressions and hard negatives from the same topic or close embedding neighborhood. Hard negatives are useful because they force the model to distinguish plausible alternatives.
Loss choices include contrastive InfoNCE/sampled softmax, triplet loss, pairwise ranking losses or multi-task objectives that also predict engagement. If the same item encoder feeds both retrieval and ranking, make sure the training objective matches both uses or split the towers when objectives conflict.
Теория
Two-tower retrieval learns a geometry where user and item embeddings can be searched efficiently by ANN.
Типичные ошибки
- Train only on random negatives and get weak discrimination.
- Use CLIP pretraining but never adapt it to product interactions.
- Forget that retrieval uses dot product/cosine while ranker may need richer cross-features.
- Ignore exposure and position bias in logged interactions.
Как отвечать на собеседовании
- Name at least two negative sources.
- Explain why the embedding must support ANN retrieval.