ML System Design
For a port waiting-time model, what features would you build beyond timestamp features, and how would you detect anomalies or broken tracking data?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Use port, ship and queue-state features plus historical congestion aggregates computed without leakage. Data-quality checks should catch impossible timestamps, inconsistent event order, extreme waits and distribution shifts.
Полный разбор
Useful features include port identity, berth/terminal type, ship class, cargo type if available, planned arrival slot, day-of-week, seasonality, weather, recent queue length, recent average service time, number of ships currently waiting and historical congestion for the same port and time bucket.
For anomaly detection, first add rule-based checks: negative waiting time, impossible event order, duplicated events, huge jumps, missing departure/arrival events and inconsistent timezone handling. Then add statistical checks over target and feature distributions: robust z-scores, percentile caps, isolation-style outlier detection or per-port control charts.
In a production system, anomaly handling should be explicit. Some anomalies are data errors and should be fixed or removed; some are rare but real disruptions and should be modeled or flagged. Keep an audit table so that the model does not silently learn from corrupted event tracking.
Типичные ошибки
- Use only timestamp features and ignore port/ship context.
- Treat every outlier as an error and delete rare real disruptions.
- Compute historical aggregates using future data.