Назад к подготовке

Вопрос про production ML

You have a large blacklist of bad INNs and noisy PDF text where digits can be glued together. How would you find likely blacklist hits efficiently and accurately?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Use blacklist-first candidate generation with exact string or rolling-hash lookup over 10/12-digit windows, then validate context and handle glued or line-broken digits with normalization and review routing.

Полный разбор

Instead of parsing every numeric token first, invert the problem. Store the blacklist INNs in a hash set or trie. Scan extracted text for 10- and 12-digit windows, or use rolling hashes for long digit runs, and check whether any substring matches a known bad INN. This is cheap and can eliminate most pages before expensive validation.

Then validate candidates. A substring hit inside a bank account number or glued BIK-KPP-account sequence is not necessarily a transaction counterparty INN. Use nearby context, known field labels, layout coordinates if available, checksum rules, transaction row segmentation, and local ML/LLM only for the small number of candidate hits.

The hard cases are line breaks, OCR/text extraction order and glued digits. Normalize whitespace, preserve page coordinates when possible, test across bank templates, and send ambiguous high-risk candidates to human review rather than suppressing them silently.

Теория

Candidate generation can use exact blacklist matching before semantic parsing, then spend precision work on a much smaller set.

Типичные ошибки

  • Run an LLM over every page to ask whether an INN exists.
  • Assume any 10- or 12-digit number is an INN.
  • Ignore glued account-number strings.
  • Drop line-broken candidates without measuring recall impact.

Как отвечать на собеседовании

  • Mention a hash set or trie for blacklist lookup.
  • Separate high-recall candidate search from high-precision validation.