Назад к подготовке

Transformer attention, токенизация и cross-attention

Transformer attention, токенизация и cross-attention

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Короткий ответ

Tokens are embedded, positional information is added or injected, self-attention mixes contextual information, masked decoder attention prevents future leakage, and cross-attention lets decoder queries attend to encoder keys and values.

Полный разбор

A Transformer starts with tokenization, often subword tokenization such as BPE or WordPiece. Token ids are mapped to embeddings. Because attention alone is permutation-invariant over positions, the model needs positional information, either added as sinusoidal or learned embeddings or injected through rotary position embeddings.

In self-attention, each token produces query, key and value vectors. Query-key dot products score relevance, scaling stabilizes logits, softmax turns scores into weights, and the output is a weighted sum of values. Multi-head attention repeats this in several learned subspaces, then the block applies residual connections, normalization and a feed-forward network.

In an encoder-decoder Transformer, the encoder builds contextual representations of the source sequence. The decoder uses masked self-attention so position t cannot see future tokens. Cross-attention then uses decoder states as queries and encoder states as keys and values, letting generation condition on the source sequence. GPT-style models are decoder-only; BERT-style models are encoder-only.

Теория

Attention is a content-based mixing mechanism; the architecture around it defines what information is allowed to flow where.

Типичные ошибки

  • Forget positional information.
  • Mix up query/key/value sources in cross-attention.
  • Describe decoder self-attention without causal masking.
  • Say BERT and GPT differ only by training data.

Как отвечать на собеседовании

  • Use one sentence each for tokenization, attention and masking.
  • For cross-attention, say decoder queries attend to encoder keys and values.