How to fine-tune OpenAI’s GPT-3.5 Turbo using Labelbox

In machine learning, fine-tuning pre-trained models is a powerful technique that adapts models to new tasks and datasets. Fine-tuning takes a model that has already learned representations on a large dataset, such as a large language model, and leverages prior knowledge to efficiently “teach” the model a new task.

The key benefit of fine-tuning is that it allows you to take advantage of transfer learning. Rather than training a model from scratch, which requires massive datasets and compute resources, you can start with an existing model and specialize it to your use case with much less data and resources. Fine-tuning allows ML teams to efficiently adapt powerful models to new tasks with limited data and compute. It is essential for applying state-of-the-art models to real-world applications of AI.

Open AI’s fine-tuning API allows teams to fine-tune the following models:

GPT-3.5 -turbo-0613 (recommended)
Babbage-002
Davinci-002

In this guide, we’ll cover how to leverage Open AI’s GPT-3.5 and Labelbox to simplify the fine-tuning process, allowing you to rapidly iterate and refine your models’ performance on specific data.

The goal of model fine-tuning is to improve the model’s performance against a specific task. Over other techniques to optimize model output, such as prompt design, fine-tuning can help achieve:

Higher quality results: Fine-tuning allows the model to learn from a much larger and more diverse dataset than can fit into a prompt. The model can learn more granular patterns and semantics that are relevant to your use case through extensive fine-tuning training. Prompts are limited in how much task-specific context they can provide, while fine-tuning teaches the model your specific domain.
Token savings: Fine-tuned models require less prompting to produce quality outputs. With fine-tuning, you can leverage a shorter, more general prompt since the model has learned your domain – saving prompt engineering effort and tokens. Whereas highly-specific prompts can often hit token limits.
Lower latency: Heavily engineered prompts can increase latency as they require more processing. As fine-tuned models are optimized for your specific task, they allow faster inference and can quickly retrieve knowledge for your domain.

Fine-tuning is especially beneficial for adapting models to your specific use case and business needs. There are several common scenarios where fine-tuning really can help models capture the nuances required for an application:

Style, tone, or format customization: Fine-tuning allows you to adapt models to match the specific style or tone required for a use case, whether it be a particular brand voice or difference in tone for speaking to various audiences.
Desired output structure: Fine-tuning can teach models to follow a required structure or schema in outputs. For example, you can fine-tune a summarization model to consistently include key facts in a standardized template.
Handling edge cases: Real-world data often contains irregularities and edge cases. Fine-tuning allows models to learn from a wider array of examples, including rare cases. You can fine-tune the model on new data samples so that it learns to handle edge cases when deployed to production.

In short, fine-tuning allows teams to efficiently adapt powerful models to new tasks and datasets, allowing ML teams to customize general models to their specific use cases and business needs through extensive training on curated data. High-quality fine-tuning datasets are crucial to improve performance by teaching models the nuances and complexity of the target domain more extensively than possible through prompts alone.

Open AI’s recommended dataset guidelines

To fine-tune an Open AI model, it is required to provide at least ten examples. Research has shown clear improvements from fine-tuning on 50 to 100 training examples with GPT-3.5-turbo. Data quality, over data quantity, is also critical to the success of the fine-tuned model.

You can learn more about preparing a dataset in Open AI’s documentation.

How to use Labelbox for fine-tuning

Labelbox is a data-centric AI platform for building intelligent applications. With a suite of powerful data curation, labeling, and model evaluation tools, the platform is built to help continuously improve and iterate on model performance. For this example, we will use the Labelbox platform to create a high-quality fine-tuning dataset.

With Labelbox, you can prepare a dataset of prompts and responses to fine-tune large language models (LLMs). Labelbox supports dataset creation for a variety of fine-tuning tasks including summarization, classification, question-answering, and generation.

When you set up an LLM data generation project in Labelbox, you will be prompted to specify how you will be using the editor. You have three choices for specifying your LLM data generation workflow:

Workflow 1: Humans generate prompts and responses

In the editor, the prompt and response fields will be required. This will indicate to your team that they should create a prompt and response from scratch.

You can make a copy of this Google Colab Notebook to generate a prompt and response dataset in Labelbox and fine-tune GPT-3.5 Turbo in OpenAI.

Workflow 2: Humans generate prompts

In the editor, only the prompt field will be required. This will indicate to your team that they should create a prompt from scratch.

Workflow 3: Humans generate responses to uploaded prompts

In the editor, a previously uploaded prompt will appear. Your team will need to create responses for that prompt.

You can make a copy of this Google Colab Notebook to generate a dataset of responses to uploaded prompts in Labelbox and fine-tune GPT-3.5 Turbo in OpenAI.

In the below example, we’ll be walking through a sample use case of summarizing and removing PII from customer support chats with Labelbox and OpenAI’s GPT-3.5 Turbo. Imagining we’re a company who wishes to summarize support logs without revealing personally identifiable information in the process, we’ll be fine-tuning an LLM to summarize and remove PII from customer support logs.

Step 1: Evaluate how GPT-3.5 performs against the desired task

Before we begin the fine-tuning process, let’s first evaluate how ChatGPT (using GPT-3.5) performs against the desired task off-the-shelf.

We uploaded the following sample chat log to ChatGPT:

“Summarize this chat log and remove any personally identifiable information in the summary:

Tom: I need to reset my account access.

Ursula: I can help with that, Tom. What’s your account email?

Tom: It’s tom@example.com

Ursula: Great, Tom. I’ve sent you a link to update your credentials”

In the above prompt, we’ve asked ChatGPT to summarize the chat log and remove any personally identifiable information.

Upon evaluation, the default GPT-3.5 model misses the mark for our desired use case.

The summary includes both Tom and Ursula’s names and explicitly mentions Tom’s email address. In order to reliably use the model for our business use case, we need to fine-tune it so that it appropriately excludes elements of personally identifiable information. To do so, we will leverage Labelbox to generate our fine-tuning dataset and use it to fine-tune GPT-3.5 through OpenAI.

Step 2: Create a LLM data generation project in Labelbox

The first step will be to upload our support chat logs to Labelbox Catalog – this will allow us to browse, curate, and send these data rows for labeling.

Next, we’ll need to create a LLM data generation labeling project in Labelbox Annotate.

Since we have an available dataset, this will be a ‘Humans generate response to uploaded prompts’ project.
When configuring the ontology, we will set the response type as ‘text’ and make the appropriate response to “summarize and remove personally identifiable information in the summary”.
During ontology creation, you can also define a character minimum or maximum and upload necessary instructions for the labeling team.

Step 3: Label data

After successfully setting up an LLM data generation project, we can queue the uploaded chat logs in Catalog for labeling in Annotate. To label data, you have the option of leveraging Labelbox Boost’s extensive workforce or use your own internal team to summarize and remove personally identifiable information in the summary.

For larger or more complex fine-tuning tasks, you can scale up to hundreds or thousands of labeled data rows. Once all data has been labeled, you can review the corresponding summary to each prompt and export the data rows.

Step 4: Export data from Labelbox and fine-tune it in OpenAI

With all necessary data labeled, we can export the dataset from Labelbox and upload it in a format that is readable by OpenAI.

OpenAI requires a dataset to be in the structure of their chat completions API, whereby each message has a role, content, and optional name. You can learn more about specific dataset requirements in OpenAI’s documentation. Using a script, we can convert the Labelbox export into OpenAI’s required conversational chat format.

{
 "messages":
 	{"role":"system",
    "content":"Given a chat log, summarize and remove personal 		identifiable information in the summary."},
    {"role":"user",
    "content":"Andy:Why has my order not shipped yet?! Bella: I 	apologize for the delay, Andy:May I have your order number? Andy: 	  It's ORDER5678. Please hurry! Bella: Thank you Andy. It's 			expedited and will ship today."},
	{"role":"assistant",
    "content":"Customer inquires about the delay in the shipment of his order. Support agent requests the order number and upon receiving 	  it, assures customer that the order has been expedited and will 		ship that day."
    }
 ]
}

After formatting our dataset, we can upload it and start a fine-tuning job using the OpenAI SDK.

You can use a copy of the following Google Colab notebook to export data from Labelbox in a format compatible with fine-tuning GPT-3.5 Turbo and begin a fine-tuning job.

Step 5: Assess the fine-tuned model’s performance in OpenAI

After the fine-tuning job has succeeded, you can navigate to the OpenAI playground and select the newly fine-tuned model for evaluation. Similarly to evaluating the initial GPT-3.5 model, we can enter a sample chat log and see how the newly fine-tuned model performs.

Compared to the off-the-shelf GPT-3.5 model, this model that has been fine-tuned on our training data is performing as expected. We can see that all names and relevant information that would be considered as personally identifiable information has been retracted.

We can also compare the fine-tuned model to the initial GPT-3.5 model and see how it performs on the same prompt. Again, we can see that while GPT-3.5 excludes some aspects of personally identifiable information, it still includes the user’s first name, so it doesn’t quite meet the expectations for our business use case.

The newly fine-tuned model has allowed us to adapt GPT-3.5 to our specific use case of concealing personally identifiable information. With Labelbox, teams can iteratively identify gaps and outdated samples in the fine-tuning data, then generate fresh high-quality data, allowing model accuracy to be maintained over time. Updating fine-tuning datasets through this circular feedback process is crucial for adapting to new concepts and keeping models performing at a high level within continuously changing environments.

To improve LLM performance, Labelbox simplifies the process for subject matter experts to generate high-quality datasets for fine-tuning with leading model providers and tools, like OpenAI.

Unlock the full potential of large language models with Labelbox’s end-to-end platform and a new suite of LLM tools to generate high-quality training data and optimize LLMs for your most valuable AI use cases.