ML System Design

You have a categorical feature such as port_id. Compare one-hot encoding with historical target aggregates for tree models, and explain the leakage risks.

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

One-hot encoding is leakage-safe but can be sparse. Target/statistical aggregates can be powerful, but they must be computed using only past or out-of-fold data; otherwise they leak label information.

Полный разбор

One-hot encoding port_id gives the model a binary split per port. For tree models it can work, but high cardinality makes splits sparse and may not capture port similarity or historical congestion strength well. It is still a safe baseline because the encoding itself does not use the label. Historical aggregates such as mean waiting time per port, recent queue length, rolling median delay or per-port seasonality are often stronger. The danger is leakage: if you compute the aggregate over the whole dataset, the row’s own target and future rows influence the feature. Offline validation will look better than real production. Use time-aware aggregation or out-of-fold target encoding. For a prediction at time t, the feature must be computed from records before t. For cross-validation, build folds in time order or calculate encodings inside each training fold only. Add smoothing for rare ports so the feature does not overfit low-count categories.