Вопрос
Which lightweight model would you use to extract fields such as INN, amount, date and payment purpose from noisy statement text, and what should it output?
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
A BERT-style token classifier or layout-aware variant can label spans as amount, INN, date, counterparty and purpose; post-processing turns spans into normalized fields and checks consistency.
Полный разбор
For a lightweight trainable model, formulate extraction as token classification or span labeling. The model reads a text fragment or transaction candidate and predicts labels such as B-INN, I-INN, B-AMOUNT, B-DATE, B-PURPOSE and O. If layout coordinates are available, a layout-aware model can help, but a small BERT-style encoder plus post-processing is a reasonable baseline.
The model should not directly regress arbitrary amounts as a number. It should identify the span that contains the amount, then deterministic post-processing normalizes separators, currency, sign and debit/credit direction. This is easier to audit and debug.
Training data should contain the same text-extraction noise as production. Evaluate both token-level/span-level metrics and downstream parsing quality: exact amount match, correct counterparty INN, row grouping quality and reconciliation with statement totals.
Теория
Field extraction is usually more reliable as span detection plus normalization than as direct generation of final business values.
Типичные ошибки
- Ask a model to output a free-form JSON with no span evidence.
- Regress money amounts directly from embeddings.
- Train on clean text while production text is column-mixed.
- Ignore post-processing and validation.
Как отвечать на собеседовании
- Use BIO token labels in the answer.
- Explain why post-processing handles numeric normalization.