logo
×

Michał JóźwiakOctober 10, 2024

Efficient LLM fine-tuning with PEFT

In a previous article, we discussed how to use Reinforcement Learning from Human Feedback (RLHF) to align Large Language Models (LLMs) with our preferences. We also briefly discussed the crucial initial step of Supervised Fine-Tuning (SFT). These two techniques are commonly used together to tune modern LLMs like GPT or Llama models.

SFT is typically used first to teach pre-trained model the “skills” we care about, for example: 

  • Downstream (“concrete”) tasks: text summarization, math/science reasoning, function calling, etc.
  • Response behavior: act like a chatbot, adopt a different persona or writing style, provide recommendations, etc.
  • Content moderation: prevent the model from giving controversial answers, enforce platform-specific rules, identify toxic content, etc.

After SFT, we need to further align the model responses with our preferences. Human preferences are often hard to specify as a machine-readable objective. RLHF addresses that by using a reward model to score LLM responses just as a human would. This model is then used in a reinforcement learning loop. You can read more about RLHF in our previous blog.

Today, we explore the data and computational requirements of fine-tuning techniques and how you can leverage Labelbox and Parameter-Efficient Fine-Tuning (PEFT) to meet them.

Need for high-quality data

Both SFT and RLHF require appropriate data. Since SFT works virtually the same as pre-training, you need a dataset consisting of prompt-response pairs. 

Training a reward model for RLHF, on the other hand, requires a dataset where multiple LLM responses to the same prompt are ranked or rated. For best results, you might need to create your own datasets, tailored to your needs. Unfortunately, manually creating datasets of sufficient size is a tedious process that can be a roadblock for many AI teams.

The Labelbox platform is designed to make that easier. As an industry-leading data factory, it can help you with every step of creating bespoke datasets—from processing and storing your assets, to creating custom workflows and managing the labeling workforce, to generating sophisticated ratings and labels through advanced tooling.  

Read more about our capabilities for different AI tasks below:

Chat arena experience comparing three multimodal models in a live, multi-turn conversaion.

Alternatively, you can try these approaches:

Computational considerations

Modern LLMs contain a huge number of parameters. Meta’s latest open source model, Llama 3.1, comes in three versions: 8, 70 and 405 billion parameters. If you use the smallest model in half precision mode (where each parameter takes two bytes of memory), you need approximately 16Gb of RAM and/or VRAM for inference. That’s not too bad, until you realize that fine-tuning might take 4 times more memory, depending on the optimization algorithm used (typically more for RLFT than SFT). 

There is a useful tool available online, where you can approximate your memory needs. Naturally, the computational overhead for fine-tuning is also much larger than for inference.

How do we fine-tune the models without a hefty cloud computing bill? The answer is to limit the number of changed parameters to a minimum. That’s where Parameter-Efficient Fine-Tuning (PEFT) methods come into play.

Parameter-Efficient Fine-Tuning (PEFT)

Broadly speaking, we can identify three strategies to limit the number of trainable parameters:

  1. Additive methods: Add a small number of parameters and freeze the existing ones
  2. Selective methods: Select a small number of parameters from the model and freeze the rest
  3. Reparametrization-based methods: Represent changes to model parameters as a much smaller parameter space

Additive methods

Additive methods consist of adding a relatively small number of new parameters (typically less than 1% of the original number). While it sounds counterintuitive, only those new parameters are updated during fine-tuning, resulting in a big computational and memory efficiency gain.

The hope is that these new parameters are enough to encode task-specific knowledge.

Additive methods vary in how the new parameters are added to the model architecture:

  • Adapter-style methods [original paper] introduce new fully connected layers in each transformer block.
  • Other methods introduce parameters which you concatenate to the embedded input (Prompt tuning) or hidden states of every layer (Prefix-tuning, P-tuning). These belong to a subcategory called “soft prompts”, because they are inspired by in-context learning, where you concatenate “hard”/normal tokens to the prompt.
  • There are also additive methods which do not belong to the subcategories above. An example is Ladder-Side tuning, which aims to train a small transformer model that reads some of the parameters from the big pre-trained network.

A drawback of additive methods is that they introduce computational overhead during model inference.

Selective methods

Selective methods are all about selecting a subset of parameters from the model for training. BitFit only fine-tunes biases of linear or convolutional layers, which constitute less than 0.1% of all parameters. Other methods like Diff Pruning, FishMask, and FAR learn which parameters are most important during the initial passes of  fine-tuning.

Reparametrization-based methods

Reparametrization-based methods represent updates to the original parameters in a very compact way, which leads to the drastic reduction of trained parameters. After fine-tuning, the updates are transformed/expanded and simply added to the original model.

The most widely used method from this family is LoRa. The idea behind it is surprisingly simple: for each parameter matrix (of size NxM) we want to fine-tune, we train an update matrix (NxM), represented by a multiplication of two smaller matrices (NxR and RxM). The hyper-parameter R denotes rank, which is a term from linear algebra. Intuitively though, the higher the R, the more trainable parameters we have, which in principle results in more capacity to learn. In practice, for large enough models, keeping the R fairly low (even single digits low according to some sources) yields the same model accuracy as higher values.

Simplified visual representation of fine-tuning with LoRa. Only blue matrices are updated. Matrices A and B are in total smaller than the full weight update matrix ΔW (how much smaller depends on hyperparameter R).

Additional methods aim to improve the efficiency and/or quality of LoRa through various means:

  • KronA - uses different matrix factorization (to get the original parameter matrix, you apply Kronecker product to the factor matrices instead of “normal” multiplication), which leads to better learning capacity for the same R.
  • QLoRa - introduces sophisticated parameter quantization and clever RAM utilization to allow for fine-tuning even large LLMs on a single GPU. It is however more computationally demanding.

Hybrid methods

A number of methods do not strictly fit into any of the above categories, but blend them. Some of these methods are: S4-model, Compacter, UniPELT.

Libraries

Hugging Face’s PEFT library is a convenient solution for applying PEFT methods to transformer models. It has first class support in Hugging Face’s TRL library, which in turn provides a full LLM fine-tuning pipeline (from SFT to RLHF). You can find complete code examples here.

Conclusion

LLM fine-tuning requires a lot less data than pre-training, but the kind of data needed is limited. Labelbox can help you create an appropriate high-quality dataset, both for SFT and RLHF.

Modern LLMs tend to be so large, even the small variants require a lot of memory and computation to fine-tune. Fortunately, you can leverage PEFT methods to reduce the number of trained parameters, which can make the process feasible even on single-GPU machines for medium-sized models (using QLoRa).

Sources