ML System Design
A bank asks a suspicious legal entity for PDF statements from other banks. Design how ML can extract compliance value from those statements.
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Frame the task as document-to-risk evidence: parse statements, extract counterparties and payments, compare against blacklists and observed activity, then produce interpretable signals for compliance decisions.
Подробный разбор
Start by clarifying the decision. The bank is not merely classifying a PDF; it needs evidence for whether a legal entity should remain serviced, whether more documents are needed, or whether the case should escalate to compliance.
Useful outputs include counterparties, INNs, payment purposes, transaction amounts, dates, turnover shares, suspicious counterparties, activity categories and differences between activity in the external bank and activity observed internally. The system should produce structured evidence and confidence, not only a binary verdict.
Constrain the first version. Assume text PDFs rather than scanned images, legal entities rather than individuals, and honest documents if that is the interviewer's scope. Then design a pipeline with extraction, validation, risk aggregation, human review and monitoring because compliance needs traceability and low false-negative risk.
Типичные ошибки
- Jump straight to an LLM classifier over the whole PDF.
- Ignore legal-entity specificity and mandatory fields.
- Return a black-box risk score without evidence.
- Forget that the regulator and the bank can both make errors.
Как сказать на собеседовании
- Clarify whether the PDF is text or scanned.
- Say that the output must be interpretable for compliance managers.
ML System Design
How would you parse readable PDF bank statements from many banks into structured transactions without sending personal data to an external API?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Use deterministic parsers for common known formats, lightweight ML/NER for noisy fields, and reserve local LLM or human review for low-confidence or unknown formats.
Подробный разбор
A practical architecture should be hybrid. For high-volume known banks and known statement templates, maintain deterministic parsers or table extractors because they are cheap, auditable and fast. For unknown or changed formats, use a generic extraction layer that converts PDF text/layout into candidate rows and fields.
Then apply validation. INN lengths and checksums, dates, debit/credit consistency, required fields and total-turnover checks can catch many extraction errors. A small local model can classify tokens or spans into INN, amount, payment purpose and date. A local LLM can be used only for low-confidence fragments or templates that rules do not cover.
Because the data is sensitive, external APIs are out of scope unless there is a compliant anonymization process, which itself is risky. Keep raw documents in the bank perimeter, log structured evidence, and design fallbacks to human review for cases where automation cannot reach the required confidence.
Типичные ошибки
- Send full statements to an external LLM.
- Use only regexes and ignore changing templates.
- Use only LLMs and ignore cost, latency and auditability.
- Skip deterministic validation of extracted fields.
Как сказать на собеседовании
- Say which formats get rules and which get ML fallback.
- Tie the architecture to privacy and audit constraints.
Вопрос про production ML
You have about 10,000 statement pages per night, 100 banks, one CPU server and sensitive data that cannot leave the bank. How do you allocate expensive local LLM usage?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Make the default path cheap and deterministic, estimate throughput, then spend the local LLM budget only on candidate fragments, unknown formats and validation failures.
Подробный разбор
First do the arithmetic. A 7B local model on CPU can be minutes per page, so it cannot process every page overnight. The pipeline needs a cheap first pass: PDF text extraction, layout-aware heuristics, regex candidates, known-template parsers and lightweight ML.
Use routing. Pages with known templates and passing validations stay on the cheap path. Unknown templates, failed total checks, suspicious blacklist candidates, or ambiguous numeric runs are routed to the local LLM or human review. The LLM should receive small fragments, not whole thousand-page statements.
Track throughput and backlog as product constraints. If the daily batch must finish before the next banking day, define per-page budgets and graceful degradation. For low-risk statements, return a conservative no-hit result only when cheap checks are strong enough; for high-risk or ambiguous cases, produce a review queue rather than forcing a low-confidence model answer.
Типичные ошибки
- Propose a local LLM over every page without throughput math.
- Ignore overnight batch deadlines.
- Send too much context to the LLM.
- Return confident verdicts for low-confidence extraction failures.
Как сказать на собеседовании
- Estimate the page budget out loud.
- Use “route only hard fragments to LLM” as the core design.
Вопрос про production ML
You have a large blacklist of bad INNs and noisy PDF text where digits can be glued together. How would you find likely blacklist hits efficiently and accurately?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Use blacklist-first candidate generation with exact string or rolling-hash lookup over 10/12-digit windows, then validate context and handle glued or line-broken digits with normalization and review routing.
Подробный разбор
Instead of parsing every numeric token first, invert the problem. Store the blacklist INNs in a hash set or trie. Scan extracted text for 10- and 12-digit windows, or use rolling hashes for long digit runs, and check whether any substring matches a known bad INN. This is cheap and can eliminate most pages before expensive validation.
Then validate candidates. A substring hit inside a bank account number or glued BIK-KPP-account sequence is not necessarily a transaction counterparty INN. Use nearby context, known field labels, layout coordinates if available, checksum rules, transaction row segmentation, and local ML/LLM only for the small number of candidate hits.
The hard cases are line breaks, OCR/text extraction order and glued digits. Normalize whitespace, preserve page coordinates when possible, test across bank templates, and send ambiguous high-risk candidates to human review rather than suppressing them silently.
Типичные ошибки
- Run an LLM over every page to ask whether an INN exists.
- Assume any 10- or 12-digit number is an INN.
- Ignore glued account-number strings.
- Drop line-broken candidates without measuring recall impact.
Как сказать на собеседовании
- Mention a hash set or trie for blacklist lookup.
- Separate high-recall candidate search from high-precision validation.
ML System Design
Finding one bad counterparty is not enough. How would you compute the share of turnover that went to suspicious counterparties across heterogeneous bank statements?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Extract transaction rows with amounts, direction and counterparties, validate against statement totals, then aggregate suspicious outgoing turnover share with confidence and review thresholds.
Подробный разбор
The business signal is not only “was there any blacklist hit”. A legitimate company may have a tiny incidental payment to a bad counterparty, while a suspicious firm may route a large share of turnover through bad counterparties. Therefore the system needs transaction-level amounts and direction.
Parse each statement into rows containing date, debit or credit amount, counterparty identifiers, payment purpose and confidence. Bank formats differ: debit and credit columns may move, summary totals may appear at the beginning or end, and PDF extraction can reorder columns. Known templates, layout features and token classification are safer than one global regex.
Validate extracted rows by reconciling debit and credit totals against summary turnover when present. Then compute features such as suspicious outgoing amount, suspicious share of total outgoing turnover, number of suspicious counterparties and concentration by counterparty. Low-confidence or non-reconciling statements should go to review or more expensive parsing.
Типичные ошибки
- Treat any blacklist hit as equally severe.
- Extract INNs but not transaction direction or amount.
- Ignore debit and credit column ambiguity.
- Skip reconciliation against statement totals.
Как сказать на собеседовании
- Say “share of outgoing turnover” rather than just count of hits.
- Use total reconciliation as the main quality check.
ML System Design
How would you build and validate a training dataset for extracting transaction fields from many bank-statement formats with limited human labeling?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Sample diverse banks and formats, label fields or bootstrap labels with high-confidence parsers, augment safely with synthetic values, and validate extraction with reconciliation checks and targeted human audit.
Подробный разбор
The dataset must be diverse by bank, statement template, time period, page length and transaction type. Do not let one thousand-page statement dominate. Stratified sampling across banks and templates gives better coverage than labeling many pages from one format.
Labels can come from manual annotation, high-confidence deterministic parsers for known templates, or offline LLM assistance inside the secure perimeter. Synthetic data can help with amounts, INNs, KPPs, names and layout variants, but it should augment real templates rather than replace them. Otherwise the model learns synthetic regularities.
Validation should combine field-level metrics and document-level invariants. Check precision and recall for INN, amount, date and payment purpose spans. Reconcile extracted debit/credit totals with statement summaries. Audit samples where totals do not match, confidence is low, or a format is new. Keep a held-out set by bank or template to measure generalization.
Типичные ошибки
- Label many pages from one bank and call the dataset large.
- Trust synthetic data without real template coverage.
- Evaluate only token-level F1 and ignore document totals.
- Let an LLM create labels with no audit or reconciliation.
Как сказать на собеседовании
- Use stratification by bank/template/page length.
- Name reconciliation as an automatic trust check.
Вопрос
Which lightweight model would you use to extract fields such as INN, amount, date and payment purpose from noisy statement text, and what should it output?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
A BERT-style token classifier or layout-aware variant can label spans as amount, INN, date, counterparty and purpose; post-processing turns spans into normalized fields and checks consistency.
Подробный разбор
For a lightweight trainable model, formulate extraction as token classification or span labeling. The model reads a text fragment or transaction candidate and predicts labels such as B-INN, I-INN, B-AMOUNT, B-DATE, B-PURPOSE and O. If layout coordinates are available, a layout-aware model can help, but a small BERT-style encoder plus post-processing is a reasonable baseline.
The model should not directly regress arbitrary amounts as a number. It should identify the span that contains the amount, then deterministic post-processing normalizes separators, currency, sign and debit/credit direction. This is easier to audit and debug.
Training data should contain the same text-extraction noise as production. Evaluate both token-level/span-level metrics and downstream parsing quality: exact amount match, correct counterparty INN, row grouping quality and reconciliation with statement totals.
Типичные ошибки
- Ask a model to output a free-form JSON with no span evidence.
- Regress money amounts directly from embeddings.
- Train on clean text while production text is column-mixed.
- Ignore post-processing and validation.
Как сказать на собеседовании
- Use BIO token labels in the answer.
- Explain why post-processing handles numeric normalization.
Deployment, артефакты and format-drift monitoring for document ML
Deployment, артефакты and format-drift monitoring for document ML
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Store raw-to-structured artifacts, extraction confidence, reconciliation results, suspicious hits and model/template versions; alert on parse failures, total mismatches, distribution shifts and bank-specific error spikes.
Подробный разбор
Persist enough evidence to debug every decision: document id, bank/template guess, parser version, model version, extracted rows, span confidences, normalized fields, suspicious INNs, turnover aggregates, reconciliation totals and final verdict. Raw sensitive files stay in secure storage with access control.
Operational metrics include processing latency, batch completion time, parse success rate, share of pages routed to LLM or human review, reconciliation mismatch rate, unknown-template rate, suspicious-hit rate and manual correction rate. Break these down by bank and template version.
Format drift often appears as a sudden bank-specific spike in parse failures, total mismatches, unknown templates, missing mandatory fields or support complaints. Add alerts and sampled review. In Airflow or a similar orchestrator, store model and parser artifacts, data snapshots, configuration, run ids and quality reports so a bad release can be rolled back and reproduced.
Типичные ошибки
- Monitor only API latency and error rate.
- Fail to version parsers and templates.
- Aggregate metrics across banks and miss one-bank format drift.
- Store only final verdicts with no extracted evidence.
Как сказать на собеседовании
- Name reconciliation mismatch rate as a key alert.
- Mention parser/template/model versions as artifacts.