PEFT in LLMs. Parameter Efficient Fine Tuning | by Akshay Shah

Parameter Efficient Fine Tuning

LoRA [Ref] freezes the trained model weights and injects trainable rank decomposition matrices into each layer of transformer, greatly reducing parameter for fine tuning. Full Fine tuning is extremely expensive or infeasible for Large Language model with 175B parameters as it involves gradient updates for all the parameters. LoRA aims to drastically reduce updates to few Million without significant drop in performance.

Intrinsic dimension is minimum number of parameters required to achieve performance related to full fine tuning on given objective function. The paper shows that tuning 200 parameters of Roberta achieves 90% of the precision achieves by full fine tuning of Roberta. The paper [Ref] empirically proposes

common NLP tasks within the context of pre-trained representations have an intrinsic dimension several orders of magnitudes less than the full parameterization.
the process of pre-training implicitly optimizes the description length over the average of NLP tasks, without having direct access to those same tasks.
there exists a fortuitous trend where larger models tend to have a smaller intrinsic dimension.

This paper proposes that pre-trained language models have lower intrinsic dimension. Inspired by this, LoRA claims that weight updates should also have lower intrinsic dimension while adaptation.

For a pre-trained weight matrix W0 ∈ Rd×k, we constrain its update by representing the latter with a low-rank de- composition W0 + ∆W = W0 + BA, where B ∈ Rd×r,A ∈ Rr×k, and the rank r ≪ min(d,k). We can show that, a matrix with r intrinsic dimension can be written as multiplication of two matrices [Ref].

The LoRA concludes with:

A pre-trained model can be shared and used to build many small LoRA modules for dif- ferent tasks. We can freeze the shared model and efficiently switch tasks by replacing the matrices, reducing the storage requirement and task-switching over- head significantly.
LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers since we do not need to calculate the gradients or maintain the optimizer states for most parameters. Instead, we only optimize the injected, much smaller low-rank matrices.
simple linear design allows to merge the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model, by construction.
LoRA is orthogonal to many prior methods and can be combined with many of them, such as prefix-tuning.
It is preferable to adapt more weight matrices than adapting a single type of weights with a larger rank.

Source link

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity