Zero-Shot Learning vs. Few-Shot Learning vs. Fine-Tuning: A technical walkthrough using OpenAI's APIs & models

With large language models (LLMs) gaining popularity, new techniques have emerged for applying them to NLP tasks. Three techniques in particular — zero-shot learning, few-shot learning, and fine-tuning — take different approaches to leveraging LLMs. In this guide, we’ll walk through the key difference between these techniques and how to implement them. 

We’ll walk through a case study of extracting airline names from tweets to compare the techniques. Using an entity extraction dataset, we’ll benchmark performance starting with zero-shot prompts, then experiment with few-shot learning, and finally fine-tune a model. By analyzing the results, we can better understand when to use zero-shot, few-shot, or fine-tuning with LLMs. You’ll also pick up tips for constructing effective prompts and setting up LLM experiments.

The goal of this guide is to:

  • Compare the quantitative results among zero-shot learning, few-shot learning, and fine-tuning on an NER use case
  • Explore how to use each of these learning techniques with Labelbox’s LLM Editor & Labelbox Model

Zero-shot learning, few-shot learning, and fine-tuning in action

For the purposes of this case study, we will be walking through an example of entity extraction featured in this Google Colab Notebook. Specifically, we have a dataset of tweets about major airlines, and the task is to use an API to extract all airlines names that appear in the tweets. 

The full dataset can be found on Kaggle here.

Here is an example data row:

"@AmericanAir do you have the phone number of a supervisor i can speak to regarding my travels today,['American Airlines’]

where the:

TWEET: @AmericanAir do you have the phone number of a supervisor i can speak to regarding my travels today

LABEL: [‘American Airlines']"

Before delving into the details, a few key technical definitions to keep in mind:

  • Zero-shot learning — a technique whereby we prompt an LLM without any examples, attempting to take advantage of the reasoning patterns it has gleaned (i.e. a generalist LLM)
  • Few-shot learning — a technique whereby we prompt an LLM with several concrete examples of task performance
  • Fine-tuning — a technique whereby we take an off-the-shelf open-source or proprietary model, re-train it on a variety of concrete examples, and save the updated weights as a new model checkpoint

Establishing a benchmark baseline with zero-shot learning

As with any scientific or machine learning experiment, it is important to establish a benchmark baseline. For this case study, we will use zero-shot learning as the baseline by experimenting with various zero-shot prompts and evaluating the performance (precision, recall, f1-score, accuracy) of these prompts against the test set.

The demo video below shows how we can leverage the ‘Humans Generate Prompt’ option of the LLM Editor within Labelbox Annotate to create various zero-shot prompts using in-house and/or external teams to scale out our annotation operations.

Once we have constructed our dataset of zero-shot prompts within Labelbox using a prompt engineering workforce, we can export them from the Labelbox UI the Labelbox Python SDK and use them in our script, as shown in the video below.

Using gpt-3.5-turbo, we see the results of the prompts, along with their performance metrics, attached below:

From these results, we observe that prompts that fared best provided a clear and structured instruction to the model. These prompts explicitly mention the expected format for identifying airline names. Furthermore, these prompt structures introduce a pattern that the model can recognize and follow. In the case of "Detect airline references..." the model is prompted to look for references in a specific format (e.g. hashtags), which may be a common pattern in tweets mentioning airlines.

Prompts that fared worst had a theme: ambiguity in the prompt response format. Examples like "What are the airlines in this tweet?" and "Find all airline mentions in the tweet" are open- ended and do not provide a specific structure, making it harder for the model to interpret the task.

What’s interesting to note is that the prompt “Identify airlines like this - [#AIRLINE_NAME_1]:'{tweet}’” did NOT perform well even though it provided an example response format. This underscores the profound impact that punctuation and grammatical structure can have for prompt engineering. Instead of interpreting “[#AIRLINE_NAME_1]" as the LLM-output format, the LLM instead interpreted it as a pattern-matching task to identify all airlines within a tweet that contain this specific format [#AIRLINE_NAME_1], of which there are none (hence, 0% across the evaluation metrics).

In addition to testing different prompts, we can run various experiments to evaluate the efficacy of the task and evaluate the impact of those experiments in Labelbox Model. One such experiment could be to compare different models (GPT-4 vs. GPT-3.5, etc.). The video below shows how we can compare GPT-4 vs. GPT-3.5 on how each model performs on extracting airlines from the prompts we created above.

Zero-shot learning netted us the following baseline benchmark on the test set: 19%.

Few-shot learning 

To build upon this benchmark, we used one of the prompts that performed well in zero-shot learning in tandem with few-shot learning — a technique whereby along with the prompt, we also feed an LLM concrete examples of task performance. These concrete examples are chosen from our training dataset (found in airline_train.csv).

This is an example few-shot prompt that we passed to the LLM:

Given the following tweets and their corresponding airlines, separated by new lines:

1) SouthwestAir bags fly free..just not to where you're going.,['Southwest Airlines']

2) Jet Blue I don't know- no one would tell me where they were coming from,['JetBlue Airways']

Please extract the airline(s) from the following tweet:

"SouthwestAir Just got companion pass and trying to add companion flg. Help!"

Using the following format - ['#AIRLINE_NAME_1] for one airline or ['#AIRLINE_NAME_1, #AIRLINE_NAME_2...] for multiple airlines.

Evaluating our model (gpt-3.5-turbo) on the test set via few-shot learning, we achieved an accuracy of 96.66%! There were 7 total misclassifications. Further inspection into these misclassifications actually reveals that there may be issues within the ground-truth dataset.

Few-shot learning netted us the following accuracy benchmark on the test set: 97%.

Fine-tuning with a training dataset

Lastly, we seek to determine whether fine-tuning would improve our results. To ensure parity across our experiments, we used the same 100 randomly generated examples from the training dataset for few-shot learning as our control variable for the fine-tuning task.

Fine-tuning netted us the following accuracy benchmark on the test set: 97%.

Comparing results

These were the final results:

1) Zero-shot learning netted us the following accuracy baseline benchmark on the test set: 19%.

2) Few-shot learning netted us the following accuracy benchmark on the test set: 97%.

3) Fine tuning netted us the following accuracy benchmark on the test set: 91%.

Key takeaways

Prompt engineering in tandem with few-shot learning versus fine-tuning both yield similar results for the task of extracting airline references from tweets. The ultimate consideration between the two boils down to achieving economies of scale. If teams find themselves needing to execute a prompt 100,000 times, the effort invested in terms of both human hours and GPU usage (or, token costs, if utilizing an API) can be justified when fine-tuning, as the cumulative savings in prompt tokens and the potential for improved output quality can accumulate significantly. Conversely, if we’re only using the prompt ten times and it's already effective, there's no rationale for fine-tuning it.

Regardless of which option you choose, we’ve seen how Labelbox Annotate and Labelbox Model can help achieve both outcomes.

We can also improve results by ensuring variability within the few-shot learning examples. Currently, we are using a naive approach by selecting 100 randomly generated examples from the training dataset, so that it fits within the context window of gpt-3.5-turbo. Despite leveraging ~10% of our training data (100 rows / 900 possible training data rows), it’s less about data quantity and more about data quality. If we can curate a few-shot learning dataset of only 100 rows that has enough variance to be representative of the larger 900 rows, then that would be most ideal, from a token-sizing and, subsequently, cost perspective in addition to an engineering perspective (achieving more with less).

Examples of variance include tweet structure as well as a healthy mix of tweets with multiple airlines referenced. For tweet structure, we can use regular expressions to create patterns to capture mentions (@username), hashtags (#hashtag), airline stock ticker symbols, or emojis. In our use case, hashtags and ticker symbols would be super beneficial, since we see them scattered throughout our training dataset (e.g. #LUVisbetter, #jetblue, #UnitedAirlines, AA, etc.).

Using Labelbox Model, we can version our few-shot learning and fine-tuning experiments by tying each experiment to a corresponding model run. Comparing evaluation metrics across different model runs allows us to either:

  • Choose the few-shot learning prompt with the best precision, accuracy, and/or recall metric
  • Choose the fine-tuned model with the best precision, accuracy, and/or recall metric

Through our case study extracting airline names from tweets, we've explored the key differences between zero-shot learning, few-shot learning, and fine-tuning for applying large language models to NLP tasks.

The best technique depends on your use case and available resources. The key is understanding how different data inputs affect an LLM's behavior. With the right approach, you can take advantage of LLMs' reasoning skills for a wide range of NLP applications. You can leverage each of these learning techniques with Labelbox’s LLM data generation editor and Labelbox Model. 

Labelbox is a data-centric AI platform that empowers teams to iteratively build powerful product recommendation engines to fuel lasting customer relationships. To get started, sign up for a free Labelbox account or request a demo.