ML System Design
If the old product used filters rather than free-form text, how would you train a query parser or query encoder before real text-query logs exist?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Bootstrap from filter sessions by generating natural-language queries for filter combinations, add human/LLM validation, train the parser, then replace synthetic assumptions with real query logs after launch.
Полный разбор
When historical text queries do not exist, the strongest source is existing filter behavior. A session with city, price, rooms, dates, amenities and clicked listings can be converted into one or several plausible text queries.
Use LLMs to generate diverse paraphrases from filter JSON, but keep the labels from the original structured filters. Add constraints so generated text does not invent attributes. Sample realistic partial filters, not only fully specified listings. Validate a subset with humans and build adversarial cases for ambiguous geo, slang, typos and long-tail amenities.
The first model can be a multi-label classifier/sequence tagger for attributes plus a geo resolver and text encoder. After launch, log real text queries, clicked/listed items and user corrections. Gradually reweight toward real data and keep synthetic data for coverage and regression tests.
Теория
Synthetic queries are a bridge, not the final distribution; the production loop must collect real queries quickly.
Типичные ошибки
- Generate perfect synthetic queries that users would never type.
- Let the LLM invent labels not present in the source filters.
- Never replace synthetic labels with real user logs.
Как отвечать на собеседовании
- Start from filter sessions.
- Mention human validation and post-launch real-query collection.