LLM Scaling and Architecture

Dense Transformers, MoE, long context, data/scale trade-offs and inference-aware architecture decisions from public technical reports.

Объяснить dense vs MoE trade-offs without assuming MoE is always cheaper or better.
Связать context length, KV-cache, model size and serving memory.
Понимать activated parameters vs total parameters as an operational distinction.
Читать LLM technical reports critically: architecture, training data, post-training, eval and serving implications.

Практическая задача

Сравнить Llama 3, DeepSeek-V3 and Qwen2.5 from public reports: architecture, data scale, context, post-training and serving implications.

Source-grounded правило

Use public reports as examples, not as reproducible full recipes; many training details remain incomplete or workload-specific.

Материалы