Вопрос

Explain the difference between BERT and GPT in terms of Transformer architecture and training objective.

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

BERT is an encoder-style bidirectional Transformer trained mainly with masked-language modeling for representation tasks. GPT is a decoder-only causal Transformer trained to predict the next token for generation.

Полный разбор

BERT uses Transformer encoder blocks with bidirectional self-attention, so each token can attend to tokens on both sides during pretraining. Its classical objective is masked language modeling: hide some tokens and predict them from context. This makes BERT strong for classification, retrieval embeddings, NER and other understanding tasks after fine-tuning. GPT-style models use decoder-only Transformer blocks with causal attention: each token can attend only to previous tokens. The objective is next-token prediction. This aligns directly with text/code generation and chat completion, so modern LLMs are mostly GPT-like decoder-only architectures. Both use attention, residual connections, normalization and feed-forward layers, but the masking pattern and training objective drive their behavior. In an interview, state encoder/bidirectional/MLM for BERT and decoder-only/causal/next-token for GPT early.

Say BERT cannot be fine-tuned for generation-like tasks but ignore its main representation role.
Describe GPT as encoder-decoder by default.
Forget the causal mask in GPT.