top of page

Parameter efficient fine-tuning for LLMs - LoRA

felixasanger4

How is it possible to fine-tune (depending on the actual size of the model) Large Language Models on a single or only a few GPUs while they are pre-trained on thousands of A100 GPUs?


A very fundamental concept to achieve this was presented in the paperย 


๐—Ÿ๐—ผ๐—ฅ๐—”: ๐—Ÿ๐—ผ๐˜„-๐—ฅ๐—ฎ๐—ป๐—ธ ๐—”๐—ฑ๐—ฎ๐—ฝ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ (Link).ย 


The concept is quite simple. Instead of fine-tuning the original parameters of the model, you freeze those and instead add ๐—ฎ๐—ฑ๐—ฎ๐—ฝ๐˜๐—ฒ๐—ฟ๐˜€ to your model weights. During back propagation only parameters of the ๐—ฎ๐—ฑ๐—ฎ๐—ฝ๐˜๐—ฒ๐—ฟ๐˜€ get updated.ย 


Instead of computing ๐˜…' = ๐—ช*๐˜… we compute ๐˜…' = ๐—ช*๐˜… + ๐—”*๐˜… while keeping the original model weights ๐—ช frozen and only optimise the adapter weights ๐—”.ย 


See the image attached for a visual explanation why ๐—” contains a lot less trainable parameters than ๐—ช.


Therefore, during fine-tuning we only have to keep the optimiser state for each adapter weight in memory which significantly reduces the GPU requirements.ย 


Of course ๐—Ÿ๐—ผ๐—ฅ๐—” is only one element in the big picture of making LLMs fine-tuneable on comparably small GPU clusters. Another important milestone is the ๐—ค๐—Ÿ๐—ผ๐—ฅ๐—” paper which I will go into more detail in my next post. It uses quantization along with low-rank adapters.





0 views0 comments

Comments


Post: Blog2_Post

ยฉ2022 by innotechfx.com

bottom of page