Назад к подготовке

База Transformer: токены, positional encoding и cross-attention

База Transformer: токены, positional encoding и cross-attention

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

A Transformer maps tokens to embeddings, adds positional information, then applies attention and feed-forward blocks. In encoder-decoder cross-attention, decoder states produce queries, while encoder outputs produce keys and values.

Полный разбор

A solid answer starts with the data flow. Text is split into tokens, token ids are mapped to embeddings, and positional information is added because self-attention itself is permutation-invariant. Positional information may be sinusoidal, learned, rotary or another relative-position scheme.

Each Transformer block combines multi-head attention, residual connections, normalization and a position-wise feed-forward network. Encoder blocks use self-attention over the full input. Decoder blocks add causal self-attention, then cross-attention over encoder outputs.

For decoder cross-attention, the decoder hidden states are projected to queries. The encoder outputs are projected to keys and values. This is what lets the decoder ask which source positions matter for generating the current target token.

Теория

Self-attention mixes positions inside one sequence; cross-attention lets one sequence attend to representations from another sequence.

Типичные ошибки

  • Say that attention alone knows token order without positional information.
  • Mix up self-attention and encoder-decoder cross-attention.
  • Describe Q/K/V as fixed inputs rather than learned projections of hidden states.

Как отвечать на собеседовании

  • Draw the encoder and decoder separately.
  • For cross-attention, say explicitly: Q from decoder, K/V from encoder.