Transformer attention, токенизация и cross-attention
Transformer attention, токенизация и cross-attention
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Tokens are embedded, positional information is added or injected, self-attention mixes contextual information, masked decoder attention prevents future leakage, and cross-attention lets decoder queries attend to encoder keys and values.
Полный разбор
A Transformer starts with tokenization, often subword tokenization such as BPE or WordPiece. Token ids are mapped to embeddings. Because attention alone is permutation-invariant over positions, the model needs positional information, either added as sinusoidal or learned embeddings or injected through rotary position embeddings.
In self-attention, each token produces query, key and value vectors. Query-key dot products score relevance, scaling stabilizes logits, softmax turns scores into weights, and the output is a weighted sum of values. Multi-head attention repeats this in several learned subspaces, then the block applies residual connections, normalization and a feed-forward network.
In an encoder-decoder Transformer, the encoder builds contextual representations of the source sequence. The decoder uses masked self-attention so position t cannot see future tokens. Cross-attention then uses decoder states as queries and encoder states as keys and values, letting generation condition on the source sequence. GPT-style models are decoder-only; BERT-style models are encoder-only.
Теория
Attention is a content-based mixing mechanism; the architecture around it defines what information is allowed to flow where.
Типичные ошибки
- Forget positional information.
- Mix up query/key/value sources in cross-attention.
- Describe decoder self-attention without causal masking.
- Say BERT and GPT differ only by training data.
Как отвечать на собеседовании
- Use one sentence each for tokenization, attention and masking.
- For cross-attention, say decoder queries attend to encoder keys and values.