ML System Design
You have a categorical feature such as port_id. Compare one-hot encoding with historical target aggregates for tree models, and explain the leakage risks.
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
One-hot encoding is leakage-safe but can be sparse. Target/statistical aggregates can be powerful, but they must be computed using only past or out-of-fold data; otherwise they leak label information.
Полный разбор
One-hot encoding port_id gives the model a binary split per port. For tree models it can work, but high cardinality makes splits sparse and may not capture port similarity or historical congestion strength well. It is still a safe baseline because the encoding itself does not use the label.
Historical aggregates such as mean waiting time per port, recent queue length, rolling median delay or per-port seasonality are often stronger. The danger is leakage: if you compute the aggregate over the whole dataset, the row’s own target and future rows influence the feature. Offline validation will look better than real production.
Use time-aware aggregation or out-of-fold target encoding. For a prediction at time t, the feature must be computed from records before t. For cross-validation, build folds in time order or calculate encodings inside each training fold only. Add smoothing for rare ports so the feature does not overfit low-count categories.
Типичные ошибки
- Compute target mean per category on the full dataset.
- Assume one-hot columns are always ignored by tree models.
- Forget rare-category smoothing and unknown-port handling.