Как работает LoRA fine-tuning
Как работает LoRA fine-tuning
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
LoRA freezes the base model and learns small low-rank matrices whose product is added to selected linear layers. Only the adapter weights get gradients, so optimizer state and trainable memory are much smaller.
Полный разбор
A dense linear layer has a weight matrix W. Full fine-tuning updates W directly, which is expensive for large LLMs because gradients and optimizer state must be stored for many parameters.
LoRA instead freezes W and learns a low-rank update Delta W = B A, where A and B have rank r much smaller than the original dimensions. During forward pass, the layer behaves like W x plus the adapter contribution B A x, often scaled by a LoRA alpha factor. Common target layers are attention projections and sometimes MLP projections.
Because only A and B are trainable, memory and compute for optimizer state are much lower. At deployment time, adapters can be kept separate and swapped per task/tenant, or merged into the base weights for simpler inference. The trade-off is that LoRA capacity depends on rank, target modules and data quality; it is not a replacement for all full fine-tuning cases.
Теория
LoRA is parameter-efficient fine-tuning through a learned low-rank delta on top of frozen base weights.
Типичные ошибки
- Say LoRA trains a separate small model unrelated to the base model.
- Forget that the base weights are usually frozen.
- Assume LoRA always has no inference cost; separate adapters can add operational complexity.
Как отвечать на собеседовании
- Use the formula Delta W = B A to make the answer concrete.
- Mention adapter swapping or merging if the role involves multi-tenant model serving.