How to Implement Reinforcement Learning from Human Feedback (RLHF)

The Artificial Intelligence (AI) revolution has been brought to reality with the development of systems and solutions that align with human values and preferences. Reinforcement Learning from Human Feedback (RLHF) is one such example of a system that has transformed model training and improved the accuracy and applicability of AI applications. 

Implementing RLHF presents a promising avenue for enhancing AI systems with human guidance. RLHF has been used to develop impressive, human-like conversational bots, such as OpenAI’s ChatGPT. While this model training technique is still under development, its application is widespread, and is the cornerstone of large language models (LLMs).     

RLHF is an extension of Reinforcement Learning (RL), a reward and punishment-based training technique for AI models. The only difference from the other RL techniques lies in the introduction of human feedback to ensure that the resultant models behave in safe, ethical, and desirable ways. Instead of relying on predefined rewards, RLHF allows human users to interactively provide feedback to the model in the form of corrections, ratings, and preferences. The feedback is taken to train a reward model, which is then used to fine-tune the target model using a reinforcement learning algorithm. 

Human-in-the-Loop as RLHF Backbone 

Human feedback is fundamental to RLHF and distinguishes RLHF from other supervised RL techniques. Since most LLMs are trained on a large corpus with diverse contexts and domains, their applicability to individual users are limited. To make these models helpful, harmless, and context-specific, a supervised machine learning technique called human-in-the-loop (HITL) is applied. HITL involves introducing human evaluators in the model-training lifecycle to show the system how to generate human-preferred content. It is considered the backbone of RLHF as it creates a continuous feedback loop where human input is integrated into the AI system to improve its usability in various contexts. The goal of human feedback is to augment a reward system for fine-tuning LLMs beyond sub-optimal generalized performance. This feedback data aligns LLMs with complex human values and preferences.  

Although RLHF is a powerful model training technique, it can be slow and costly to implement and maintain because of its reliance on human feedback. Without the right tools and systems in-place, collecting and annotating human feedback for pre-training and fine-tuning datasets can be an expensive and time-consuming proposition. 

Let’s dive into the current approach being used by AI leaders developing new LLMs, leveraging RLHF to incrementally fine-tune base LLMs using human-provided reward signals and automated inputs.

The RLHF Process: A Step-by-Step Guide

While RLHF is a complex concept coupled with multiple model-training processes and tools, its implementation can be broken down into four straightforward steps:

Step 1: Pre-training a Language Model 

Pre-training a language model is the foundation of the RLHF process. It involves coming up with a base model through an end-to-end training or simply selecting a pre-trained language model to begin with. Depending on the approach taken, pretraining is the most tedious, time-consuming, and resource-intensive phase of RLHF. Since training a language model from scratch complicates the RLHF process even more, choosing one of the many pre-trained models is recommended.

Simply put, RLHF can be seen as a way of unlocking the capabilities of the existing pre-trained models. For example, a conversational AI chatbot can be developed from existing mid-sized models like LLaMA, which has 65 billion parameters, or the extensive GPT-4 model, which has 1.76 trillion parameters. Selecting or pre-training the right base model depends on the available resources and task at hand, as there is no universally best model to kickstart RLHF training.

Pretrained LM as a starting point: Image from Huggingface

Step 2: Supervised Fine-tuning

The base model pre-trained or selected in step 1 above has the responses that users may want, but lacks the context and capability to generate them in formats expected by users. Therefore, before reinforcement learning, supervised fine-tuning (SFT) is applied on the pre-trained model. The goal of SFT is to prime the model to respond appropriately to different user prompts. It uses supervised learning in which human annotators point the existing model to specific desired patterns through prompting. It is a significant starting point for RLHF implementation. Prompting guides the model towards the desired output per training data. It is important to note that the SFT phase focuses on optimizing the base model’s parameters by distilling the original language model on context clues for the target criteria. For that reason, having a model that responds well to diverse instructions is foundational to the RLHF process. 

The goal of the SFT phase of the RLHF process is to prime the base model to understand user goals, language patterns, and contexts. It exposes the model to diverse linguistic patterns that enable it to generate coherent and contextually appropriate text. The human trainer guides the base model to iterate numerous examples of human-preferred outputs. Throughout this process, the model learns various relationships between words and concepts and their appropriate usage. This text-based machine learning approach is the building block of Natural Language Processing (NLP). However, at this point, the model still lacks the human touch and preferences. Additional data is needed to bring this human-like feel to the model. This is where human feedback comes in. 

In the next phase of RLHF implementation, a reward model is developed from the pre-trained language model blueprint to integrate human preferences into the system. 

Step 3: Training a Reward Model Using Human Feedback

A reward model (RM) is key for implementing RLHF. In an ideal scenario, we could just fine-tune the base model using RLHF techniques until we achieve a domain-specific model. However, this approach would need large chunks of training samples directly fed into the base model by the human annotator. As a result, this could be slow, expensive, and counterproductive. The best way to overcome these shortcomings is by training a reward model and introducing it in the RL loop.

A reward model’s precept is to map the input text with a scalar reward value in ways humans would. It is an alignment tool that evaluates the base model’s output and returns a reward signal, which is then used by the main LLM to optimize its parameters. 

Human annotators do the heavy lifting in this phase of RLHF implementation. They generate the training dataset (prompt answer pairs) and rank them according to their preferences before feeding them to the model. The reward model then has to align its ‘rewarding’ system with the patterns of such samples. Nonetheless, this process is subjective since the human perception reinforced by annotators could be biased. As such, there is a need for diversity when creating prompt and reward pairs.       

In practice, annotation prediction would be the most straightforward way to build a reward model. The model could be curated to provide a rating score and determine which output aligns more closely with human preferences. It then rewards the more appropriate output. The reward and prompt pairs are then used to train the reward model to associate specific outputs with reward values. 

Human feedback is then used to refine the reward model as it matures. For instance, users can rank the AI output with either a thumbs-up or thumbs-down feedback. This feedback data gives the reward model insights into human preference. It can then automatically rank the RL agent’s output without human intervention while iteratively learning from such feedback to better imitate humans. 

Reward model training lifecycle TechTalks

Step 4: Fine-Tuning the RL Policy with the Reward Model 

Fine-tuning is one of the ways to unlock the potential of LLMs. It involves upskilling the base model for specific tasks and adapting it to more specialized domains. This is the last phase of RLHF implementation. It involves creating a feedback loop to train and fine-tune the RL policy (a copy of the original LLM) with the reward model trained in step 2 above. 

The RL policy interacts with the reward model by taking reward signals to adjust its behavior. It then sends its output to the reward model for evaluation. The output is evaluated, and a reward score is sent back to the RL policy. Through the RM’s reward score, the RL policy can generate responses it deems human-preferable.

A policy-gradient RL algorithm called Proximal Policy Optimization (PPO) and the Kullback-Leibler (KL) Divergence are the basis of this RLHF phase. The RL policy is optimized using PPO, which balances exploitation and exploration during training. At this point, some base LLM parameters are frozen because fine-tuning, let’s say, 65 billion parameters would be practically slow. 

PPO fine-tuning improves training stability by limiting changes made to the policy at each training epoch. However, given the chance, the PPO algorithm might exploit the imperfections of a reward model and generate hogwash output. To counter such exploits, the KL penalty is introduced. 

In RLHF, Kullback-Leibler (KL) Divergence measures the difference between a reference distribution representing the most human-aligned response and the RL policy’s current responses. Simply put, it penalizes the RL policy from substantially veering off the base model with each training batch.       

RL policy fine-tuning using the reward model and the PPO from GitHub Gist

Fine-tuning with the reward model discourages inappropriate responses by punishing those with low rewards. Since such low-reward outputs are unlikely to be repeated, the language model iteratively learns and produces outputs that align closely with human expectations– this is the prime of RLHF.   

RLHF Use Cases

RLHF has proven useful in developing models used in healthcare, technology, banking, and finance, among other fields, from pre-trained models like GPT-4 and LLaMA. OpenAI’s InstructGPT, Anthropic’s Claude and Google’s Gemini are some of the successful applications of RLHF.  

  • OpenAI’s InstructGPT – In NLP, InstructGPT is undoubtedly the most successful use case for RLHF. OpenAI used the prompts submitted by its customers to the API as human feedback. The prompts were annotated and added to the training set to fine-tune GPT-3. The resulting InstructGPT model was more performant in following instructions than the base GPT-3 model. Despite having only 1.3 billion parameters compared to the base model with 175 billion parameters, the InstructGPT model performs better, thanks to RLHF. 
  • Anthropic’s Claude – RLHF has also been applied in training Claude, Anthropic’s next-gen AI, to be more helpful. The AI assistant depends on human feedback and AI Constitution to align its responses with human values and preferences.    
  • Google’s Gemini – Google Deepmind introduced Gemini Ultra, a model enhanced through RLHF. This powerful model outperformed GPT-4 on several benchmarks, including reasoning and math, because it was primed to generate helpful and harmless responses using RLHF.         

Challenges of RLHF 

While RLHF emerged as a groundbreaking AI model training technique, its implementation is not always a straightforward process. Some of the limitations of RLHF are:

  • Shortcomings of the human agents – Introducing human agents into the training cycle comes with issues of reliance, scalability, and bias (divergence from the expected outcome). Ineffective human feedback may lead to suboptimal performance and create biases, leading to skewed learning. 
  • Time and resource complexities – Scaling the models to handle more complex tasks could also be time-consuming and resource-intensive with the introduction of human agents in the training cycle.

However, the benefits outweigh the setup and maintenance costs of implementing RHF. The challenges can be mitigated by balancing feedback, diversifying the perspectives of human annotators, and evaluating performance of the model from time to time. Another method used by companies like Anthropic and Google to mitigate the cost and timeframe of gathering the necessary feedback is RLAIF, or reinforcement learning from AI feedback. By using another LLM to augment or replace the volume of human feedback needed during the supervised fine-tuning stage, teams can move faster and even choose to focus human efforts on more expert-level, domain-specific evaluation tasks.

Final Thoughts on Reinforcement Learning from Human Feedback (RLHF)

As AI advances, RLHF ensures that the LLMs’ capabilities are aligned with complex human preferences, goals, and environments. RLHF has significantly revolutionized the subfield of NLP, specifically downstream LLM applications. 

It has pioneered the humanization of AI solutions, following feedback from users and the alignment of human annotators. We have seen that implementing RLHF takes a three-step process that starts with a pre-trained model and ends with fine-tuning the base model with a reward model trained from human feedback. 

Introducing humans in the training loop is the cornerstone of RLHF, although a balance is needed as slight inefficiencies could lead to biases and skew the model learning process. It is important to note that RLHF performance is only as good as the quality of human annotators and the human-generated text used for fine-tuning. 

Labelbox is a complete solution combining the best tools and fully managed services for reinforcement learning from human feedback (RLHF) and LLM evaluation. We ensure helpful, trustworthy, safe outputs with highly accurate datasets for instruction tuning, RLHF, and supervised fine-tuning. Get started with a free trial of the platform and see how Labelbox helps you ship better LLMs.