Назад к подготовке

ML System Design

How would you parse readable PDF bank statements from many banks into structured transactions without sending personal data to an external API?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Use deterministic parsers for common known formats, lightweight ML/NER for noisy fields, and reserve local LLM or human review for low-confidence or unknown formats.

Полный разбор

A practical architecture should be hybrid. For high-volume known banks and known statement templates, maintain deterministic parsers or table extractors because they are cheap, auditable and fast. For unknown or changed formats, use a generic extraction layer that converts PDF text/layout into candidate rows and fields.

Then apply validation. INN lengths and checksums, dates, debit/credit consistency, required fields and total-turnover checks can catch many extraction errors. A small local model can classify tokens or spans into INN, amount, payment purpose and date. A local LLM can be used only for low-confidence fragments or templates that rules do not cover.

Because the data is sensitive, external APIs are out of scope unless there is a compliant anonymization process, which itself is risky. Keep raw documents in the bank perimeter, log structured evidence, and design fallbacks to human review for cases where automation cannot reach the required confidence.

Теория

Document AI at bank scale is usually a routing and validation problem as much as a modeling problem.

Типичные ошибки

  • Send full statements to an external LLM.
  • Use only regexes and ignore changing templates.
  • Use only LLMs and ignore cost, latency and auditability.
  • Skip deterministic validation of extracted fields.

Как отвечать на собеседовании

  • Say which formats get rules and which get ML fallback.
  • Tie the architecture to privacy and audit constraints.