Назад к подготовке

ML System Design

How would you build and validate a training dataset for extracting transaction fields from many bank-statement formats with limited human labeling?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Sample diverse banks and formats, label fields or bootstrap labels with high-confidence parsers, augment safely with synthetic values, and validate extraction with reconciliation checks and targeted human audit.

Полный разбор

The dataset must be diverse by bank, statement template, time period, page length and transaction type. Do not let one thousand-page statement dominate. Stratified sampling across banks and templates gives better coverage than labeling many pages from one format.

Labels can come from manual annotation, high-confidence deterministic parsers for known templates, or offline LLM assistance inside the secure perimeter. Synthetic data can help with amounts, INNs, KPPs, names and layout variants, but it should augment real templates rather than replace them. Otherwise the model learns synthetic regularities.

Validation should combine field-level metrics and document-level invariants. Check precision and recall for INN, amount, date and payment purpose spans. Reconcile extracted debit/credit totals with statement summaries. Audit samples where totals do not match, confidence is low, or a format is new. Keep a held-out set by bank or template to measure generalization.

Теория

Parser datasets need coverage of layout variation and validation by accounting invariants, not just random row labels.

Типичные ошибки

  • Label many pages from one bank and call the dataset large.
  • Trust synthetic data without real template coverage.
  • Evaluate only token-level F1 and ignore document totals.
  • Let an LLM create labels with no audit or reconciliation.

Как отвечать на собеседовании

  • Use stratification by bank/template/page length.
  • Name reconciliation as an automatic trust check.