ML System Design
Finding one bad counterparty is not enough. How would you compute the share of turnover that went to suspicious counterparties across heterogeneous bank statements?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Extract transaction rows with amounts, direction and counterparties, validate against statement totals, then aggregate suspicious outgoing turnover share with confidence and review thresholds.
Полный разбор
The business signal is not only “was there any blacklist hit”. A legitimate company may have a tiny incidental payment to a bad counterparty, while a suspicious firm may route a large share of turnover through bad counterparties. Therefore the system needs transaction-level amounts and direction.
Parse each statement into rows containing date, debit or credit amount, counterparty identifiers, payment purpose and confidence. Bank formats differ: debit and credit columns may move, summary totals may appear at the beginning or end, and PDF extraction can reorder columns. Known templates, layout features and token classification are safer than one global regex.
Validate extracted rows by reconciling debit and credit totals against summary turnover when present. Then compute features such as suspicious outgoing amount, suspicious share of total outgoing turnover, number of suspicious counterparties and concentration by counterparty. Low-confidence or non-reconciling statements should go to review or more expensive parsing.
Теория
Risk aggregation depends on correctly extracting both the suspicious counterparty and the denominator of business activity.
Типичные ошибки
- Treat any blacklist hit as equally severe.
- Extract INNs but not transaction direction or amount.
- Ignore debit and credit column ambiguity.
- Skip reconciliation against statement totals.
Как отвечать на собеседовании
- Say “share of outgoing turnover” rather than just count of hits.
- Use total reconciliation as the main quality check.