Разбиение данных и утечки в фрод-модели
Разбиение данных и утечки в фрод-модели
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Prefer a time-based split that mimics deployment. Check leakage from future aggregates, labels or actions created after the prediction time, duplicate users/entities across splits, and identifiers that proxy the target too directly.
Полный разбор
For fraud and other time-dependent systems, a random split often overestimates quality. The model will be used on future events, so validation should usually train on earlier time periods and validate on later periods. If the product has repeated users, merchants or devices, consider grouped or entity-aware splits as an additional stress test.
Leakage checks should follow the prediction timestamp. Any feature must be available at decision time. Common leaks include aggregates computed over the full dataset, chargeback labels or moderation actions that happen after the event, future user behavior, target-derived flags, and entity IDs that memorize repeat offenders across random splits.
A good validation report includes temporal holdout performance, calibration/threshold behavior, segment metrics and drift monitoring. If fraud patterns change quickly, keep a recent validation window and backtest across several time periods.
Теория
A validation split is a simulation of production. If it gives the model future information, the offline score is not trustworthy.
Типичные ошибки
- Use a random split for a temporal fraud problem without checking leakage.
- Build aggregate features before splitting.
- Let the same fraudulent entity appear in both train and validation in a way that cannot happen at launch.
Как отвечать на собеседовании
- Anchor every feature to "available at prediction time".
- Give concrete leakage examples, not only the word leakage.