How is it possible to fine-tune (depending on the actual size of the model) Large Language Models on a single or only a few GPUs while they are pre-trained on thousands of A100 GPUs?
A very fundamental concept to achieve this was presented in the paperย
๐๐ผ๐ฅ๐: ๐๐ผ๐-๐ฅ๐ฎ๐ป๐ธ ๐๐ฑ๐ฎ๐ฝ๐๐ถ๐ผ๐ป ๐ผ๐ณ ๐๐ฎ๐ฟ๐ด๐ฒ ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น๐ (Link).ย
The concept is quite simple. Instead of fine-tuning the original parameters of the model, you freeze those and instead add ๐ฎ๐ฑ๐ฎ๐ฝ๐๐ฒ๐ฟ๐ to your model weights. During back propagation only parameters of the ๐ฎ๐ฑ๐ฎ๐ฝ๐๐ฒ๐ฟ๐ get updated.ย
Instead of computing ๐ ' = ๐ช*๐ we compute ๐ ' = ๐ช*๐ + ๐*๐ while keeping the original model weights ๐ช frozen and only optimise the adapter weights ๐.ย
See the image attached for a visual explanation why ๐ contains a lot less trainable parameters than ๐ช.
Therefore, during fine-tuning we only have to keep the optimiser state for each adapter weight in memory which significantly reduces the GPU requirements.ย
Of course ๐๐ผ๐ฅ๐ is only one element in the big picture of making LLMs fine-tuneable on comparably small GPU clusters. Another important milestone is the ๐ค๐๐ผ๐ฅ๐ paper which I will go into more detail in my next post. It uses quantization along with low-rank adapters.

Comments