Вопрос
Explain the difference between BERT and GPT in terms of Transformer architecture and training objective.
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
BERT is an encoder-style bidirectional Transformer trained mainly with masked-language modeling for representation tasks. GPT is a decoder-only causal Transformer trained to predict the next token for generation.
Полный разбор
BERT uses Transformer encoder blocks with bidirectional self-attention, so each token can attend to tokens on both sides during pretraining. Its classical objective is masked language modeling: hide some tokens and predict them from context. This makes BERT strong for classification, retrieval embeddings, NER and other understanding tasks after fine-tuning.
GPT-style models use decoder-only Transformer blocks with causal attention: each token can attend only to previous tokens. The objective is next-token prediction. This aligns directly with text/code generation and chat completion, so modern LLMs are mostly GPT-like decoder-only architectures.
Both use attention, residual connections, normalization and feed-forward layers, but the masking pattern and training objective drive their behavior. In an interview, state encoder/bidirectional/MLM for BERT and decoder-only/causal/next-token for GPT early.
Типичные ошибки
- Say BERT cannot be fine-tuned for generation-like tasks but ignore its main representation role.
- Describe GPT as encoder-decoder by default.
- Forget the causal mask in GPT.