Parameter efficient fine-tuning for LLMs - LoRA

felixasanger4
Oct 28, 2024
1 min read

How is it possible to fine-tune (depending on the actual size of the model) Large Language Models on a single or only a few GPUs while they are pre-trained on thousands of A100 GPUs?

A very fundamental concept to achieve this was presented in the paper

𝗟𝗼𝗥𝗔: 𝗟𝗼𝘄-𝗥𝗮𝗻𝗸 𝗔𝗱𝗮𝗽𝘁𝗶𝗼𝗻 𝗼𝗳 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 (Link).

The concept is quite simple. Instead of fine-tuning the original parameters of the model, you freeze those and instead add 𝗮𝗱𝗮𝗽𝘁𝗲𝗿𝘀 to your model weights. During back propagation only parameters of the 𝗮𝗱𝗮𝗽𝘁𝗲𝗿𝘀 get updated.

Instead of computing 𝘅' = 𝗪*𝘅 we compute 𝘅' = 𝗪*𝘅 + 𝗔*𝘅 while keeping the original model weights 𝗪 frozen and only optimise the adapter weights 𝗔.

See the image attached for a visual explanation why 𝗔 contains a lot less trainable parameters than 𝗪.

Therefore, during fine-tuning we only have to keep the optimiser state for each adapter weight in memory which significantly reduces the GPU requirements.

Of course 𝗟𝗼𝗥𝗔 is only one element in the big picture of making LLMs fine-tuneable on comparably small GPU clusters. Another important milestone is the 𝗤𝗟𝗼𝗥𝗔 paper which I will go into more detail in my next post. It uses quantization along with low-rank adapters.

Comments