LabelboxJune 29, 2023

How to confidently compare, test, and evaluate models for machine learning

This is the fourth post in a blog series showcasing how Model Foundry brings the power of foundation models into Labelbox. Learn how to leverage Labelbox for model comparison and A/B testing.


For many AI teams, the rise of off-the-shelf and foundation models has transformed the way they build and train models. Some can simply choose an existing model and fine-tune it for their specific requirements, making the AI development process much faster. Even if that's not possible for your use case, using foundation and off-the-shelf models can accelerate your process if you use them to pre-label data for your use case.

Choosing the best algorithm for the task, however, is key to ensuring the best results — starting with the wrong model can cause delays and less optimal model performance. That's why leading teams are adding a comprehensive model comparison process to their AI development workflow by evaluating different models based on performance metrics, how well the model performs on unseen data, and how well suited a model is to their ultimate business goals. Beyond measuring model performance, model comparison is useful in helping teams tracking ML experiments and creating a “store of record” for future reference and to help improve model performance.

While model comparison is a critical part of the ML workflow to ensure you’re using the best and most efficient model for your use case, ML teams encounter some common challenges during this process, such as:

  • Confidently assessing the potential and limitations of pre-trained models
  • Visualizing models’ performance for comparison
  • Effectively sharing experiment results

In this blog post, we’ll explore how you can tackle these challenges with Model Foundry, a soon-to-be beta released solution from Labelbox that enables ML teams to better leverage foundation models for AI development.

Why is model comparison important?

A/B testing in machine learning allows teams to quickly experiment and iterate in order to improve and reach their business objectives. When it comes to leveraging foundation models, A/B testing can refer to evaluating how two models perform in production.

Foundation models are base models that are trained on inherent scale data and that act as a starting point for various downstream tasks or applications. A/B testing allows you to effectively assess the performance of different versions or variations of a foundation model. By comparing their effectiveness, you can determine which model performs better in terms of metrics such as accuracy, precision, or recall on your data and for your specific business use case. This approach also lets you systematically make improvements based on qualitative and quantitative metrics – allowing you to optimize the foundation model’s performance and enhance its capabilities over time.

However, effective A/B testing requires a platform that is not only able to provide sufficient metrics on assessing the performance of models, but one that also allows teams to share experiment findings and have a store of record. When models are evaluated, having a store of record of the model experiment is crucial so that a team can retrace the steps leading to the model selection and compare the next iteration of the model against the previous version.

Introducing Model Foundry

Foundation models are changing the landscape of AI – automating complex tasks such as data labeling and enrichment. With Model Foundry, teams have the ability to easily A/B test and compare a wide-range of open-source and third-party foundation models in a single platform.

You can test, compare, and evaluate models across various prompts and hyperparameters to confidently select the best model to perform pre-labeling or data enrichment tasks on your data. Selecting the best performing model on your data is crucial to continuously improving and scaling model performance in less time and at a lower cost. Regardless of your team’s AI maturity and business use case, you can experiment with foundation models in a no-code environment and continue to evaluate and iterate on model performance.

Access the world’s best AI models

Model Foundry will feature various public and private models computer vision (CV) and natural language processing (NLP) use cases, such as SAM, YOLOv8, OWL-ViT, and GPT, as well as popular models from Google, OpenAI, Databricks, and Anthropic​​.

Automatically gain insight into how a specific model performs on a subset of data. Rather than doing this through a Colab notebook, quickly generate pre-labels from a chosen model in a few clicks and understand how a model performs on the given task.

Save time with a streamlined A/B testing framework

To conduct a comprehensive analysis of a model, it helps to be able to compare the predictions of each model with the original ground truth labels side-by-side. A model run in Labelbox provides a versioned data snapshot of the data rows, annotations, and data splits for that given model run. At the end of each Model Foundry job, a model run will automatically be populated for your analysis.

As you continue iterating on your model and data, it is likely that you’ll end up with many model runs. The goal of comparing models is to measure and understand the marginal value of every machine learning iteration. Each model run is a versioned snapshot of an experiment that can be revisited and analyzed by your team at any given time. To make A/B testing even easier, Labelbox Model provides a model comparison view that allows you to visually compare the performance of two models as well as compare the two models with scalar and confusion metrics.

Dig into comprehensive evaluation metrics

In order to compare or ensure the effectiveness of a model, you need to understand how it has performed on the given task. Leveraging Labelbox’s Model, you can conduct a comprehensive analysis of a model’s performance with holistic metrics that evaluate its performance.

Dive into auto-generated quantitative model metrics, such as precision, recall, F-1 score, and more. To gain a deeper understanding of each model’s performance, it is important to analyze where the models are performing well and where they might be struggling. Model metrics are valuable in helping surface low-confident predictions, areas of agreement/disagreement between model predictions and ground truths, and can help your team analyze model performance, detect labeling mistakes, and find model errors. While most metrics are auto-generated by Labelbox, you can also update your own custom metrics depending on your specific use case.

Model comparison in practice

Let’s take a look at a real-world example – comparing GPT-4 and Claude – two powerful LLMs developed to date. In this experiment, we will be comparing how these two models perform on a custom text datasets and systematically evaluate and compare their zero-shot predictive accuracy and generative ability.

Dataset and problem setup

For this experiment, we obtained 100 data points from the Kaggle Wikipedia Movie Plots dataset, which provides detailed information on movie plots, genre, and more. We are interested in assessing the predictive capabilities of GPT-4 and Claude in determining the movie genre based on plot and also their ability to generate a suitable and concise summary.

To narrow the scope, we are only interested in the following categories: ‘comedy’, ‘animated’, ‘sci-fi’, ‘thriller’, ‘action’, ‘family’, ‘fantasy’, ‘horror’, ‘adventure’, and ‘drama’. We’ll also be evaluating models on their precision, recall, F1 scores, and confusion matrix.

Model and prompt setup

When working with these models, creating effective prompts is important and there are a variety of techniques available for prompt engineering. For the purpose of this experiment, we chose to use simple prompt templates that clearly describe the ML task and the expected output format, which is a structured JSON format. This made it easy for us to incorporate the prompts into our existing workflows. To ensure a fair comparison, we used the exact same prompts for both models during evaluation.

The prompt without examples:

For this movie plot description, describe plot_summary, or answer N/A if you are not confident.The plot summary should be short 1 sentence description. Classify movie genres by picking one or more of the options: [comedy, animated, sci-fi, thriller, action, family, fantasy, horror, adventure, drama]. 
Return the result as a json: {"plot_summary" : "<answer>", "movie_genres" : ["<prediction>"]} 

{insert movie plot}

The prompt above asks the LLM to provide one sentence summary to classify the movie genres and return the answer in a structured JSON format with the predicted text at the end. After specifying the prompt and running inference, we can automatically see the model outputs in a model run in Labelbox Model.

Findings and results

Once the model job is complete, you can visualize and evaluate the results of the ChatGPT on product categorization and summary in Labelbox Catalog and Model.

In a model run, you can evaluate model results for a comprehensive view of quantitative and qualitative metrics. In the dropdown, you can select two model runs of your choice for comparison – Labelbox automatically assigns each model run a different color so that it can be distinguished in metrics and visualizations. In the below image, you can see the GPT-4 model run in orange and the Claude model run in blue.

Model run comparison view in Labelbox Model

Quantitative comparison

In our evaluation, we found that both models demonstrate impressive out-of-the-box zero-shot performance. Claude scored higher on the overall F-1 score, whereas GPT-4 scored higher on all other overall metric scores.

Left: GPT-4 model run, Right: Claude model run

Auto-generated metrics provide a deeper understanding of each model’s performance, allowing for analysis on where the models performed well and where they might be struggling. By conducting an analysis by class, we can gain insight into specific areas where the models are successful and where they are falling short.

Metrics provide insight into the success and failures of the two language models

For example, although both models perform well in terms of recall for ‘sci-fi’ and ‘adventure’ genres for recall, they have a low precision score, indicating that both models are overly confident in assigning these labels to movie plots. As a result, only a small portion of sci-fi and adventure genre predictions correspond to the actual ground truth labels, contributing to the low precision score. This is also true for Claude’s performance in the fantasy genre.

On the other hand, both models are very good at classifying ‘comedy’ genres. The genre ‘animated’ and family had a very small sample size of only one data point each, which isn’t sufficient for meaningful analysis.

Qualitative comparison

Let’s now take a look at how GPT-4 and Claude summarize selected movie plots. Overall, both models performed well in being able to capture the movie plot in a single sentence, although they differed slightly in their level of consciousness and abstraction. Claude was able to generate shorter summaries, while GPT-4 was able to convey more captivating plot details.

Check out a few examples below:

Original Movie plot

Claude Summary

GPT-4 summary 

Jesse Owens, a promising black American runner, attends Ohio State University, despite facing racial discrimination and slurs from the white athletes. He attracts the attention of coach Larry Snyder, who believes Owens has enormous potential but needs work on his form and technique. When Snyder suggests he is good enough to compete at the 1936 Berlin Olympic Games, Owens is interested, but hesitant because of racial propaganda in Nazi Germany. The U.S. Olympic Committee is already considering boycotting the Olympics over Germany's discriminatory policies, only agreeing to participate when Nazi official Joseph Goebbels gives personal assurances they will allow foreign athletes of any race to compete, as well as promising to rein in their propaganda.

As Owens struggles to support his girlfriend Ruth and young daughter, he takes a job at a service station, upsetting Snyder when his job conflicts with his commitment to train. When Snyder learns Owens has a family to support, he gets him a no-show job, allowing him to focus on running. Owens goes on to break several records, including some of Snyder's, and begins a relationship with a woman whose attentions he attracts with his newfound fame. When Ruth threatens him with a lawsuit, Owens becomes distracted and loses a race to runner Eulace Peacock. Owens decides to return to Ruth to reconcile, convincing her to marry him. As the Olympics draw closer, the NAACP asks him not to go to Berlin for political reasons. Owens is conflicted, but Peacock urges him to compete to defy Nazi racial ideology.

In Berlin, Owens wins his first gold medal in the 100 m dash, but when he is brought by International Olympic Committee member Avery Brundage to receive congratulations from Adolf Hitler, he is told the Chancellor has left the stadium early to avoid traffic. Brundage warns Goebbels that Hitler must congratulate all winners or no winners, with Goebbels replying Hitler will not appear with "that". Owens next qualifies for the broad jump after German rival Luz Long unexpectedly gives him tips on his technique. Owens wins another medal, and Luz publicly shakes his hand and joins him for a victory lap, privately expressing concern about national politics. Owens wins his fourth and final medal in the 4 x 100 m relay, filling in for two Jewish American athletes cut by Brundage, who is convinced by Goebbels to do so to avoid a scandal over a business arrangement Brundage entered with the Nazis before the games. Director Leni Riefenstahl films the event against Goebbels' orders, then asks Owens to repeat his medal-winning broad jump to get a few more shots for her next film, Olympia.

Jesse Owens, a promising black sprinter, faces discrimination while attending Ohio State University in the 1930s. He qualifies for the 1936 Berlin Olympics, where he wins four gold medals, defeating Nazi propaganda about Aryan supremacy.

Jesse Owens, a talented black American runner, overcomes racial discrimination and personal struggles to win four gold medals at the 1936 Berlin Olympics, defying Nazi propaganda and racial ideology.

Bernard "Bernie" Webber, a crewman at the Coast Guard station in Chatham, Massachusetts, on Cape Cod, has fallen in love with a local girl, Miriam Pentinen. The two plan to marry on April 16, but in accordance with Coast Guard regulations, Bernie must seek permission from the station's commander, Daniel Cluff. On the day he is due to ask permission, the oil tanker SS Pendleton shears in half off the Chatham coast after getting caught in a nor'easter. With the majority of the station's crew already underway with the rescue of the similarly damaged SS Fort Mercer, Bernie is dispatched to pilot motor lifeboat CG 36500 to rescue the crew of Pendleton. Andrew Fitzgerald, Ervin Maske, and Richard P. Livesey volunteer to join Bernie on the rescue mission.

Meanwhile, the Pendleton's engineer, Ray Sybert, as the surviving senior officer, organizes the surviving seamen to steer the sinking stern of the tanker onto a submerged reef, where it can lodge until rescuers arrive.

When Miriam hears that Bernie is leading the rescue effort, she, like a number of townspeople, thinks this is a suicide mission. Few people in Chatham trust Cluff, since he is not from the area and does not know its treacherous weather. Miriam drives to the station, demanding that Cluff call Bernie back. Cluff refuses, and brusquely orders Miriam out.

Between the Chatham harbor and the open sea lies a bar, a series of shoals that are very dangerous even in good weather. Bernie must time bursts of his engine to ride each approaching wave before it breaks as he pilots CG 36500 across the bar. Although he makes it over the bar, he loses his compass.

Bernie steers CG 36500 to the stricken tanker. Although his boat's designated capacity is only 12 people, Bernie manages to rescue 32 crewmen. The stern of Pendleton begins sinking more rapidly during the rescue and goes down shortly after the last crewman comes aboard Bernie's boat. Relying on his knowledge of the coast and prevailing winds in place of his compass, Bernie steers CG 36500 toward home—a task made more difficult as Chatham loses power. Miriam and the other townspeople drive their cars to the pier and turn on their headlights to guide Bernie in.

The film shows photographs from the event that briefly document the aftermath of the rescue. Two months later, Bernie and Miriam marry; they stay together for 58 years until Bernie's death in 2009. Webber and his crew receive the Gold Lifesaving Medal.

A small Coast Guard crew conducts a dangerous rescue mission in a storm to save the lives of 32 crewmen from an oil tanker that has split in half.

In a dangerous rescue mission, Coast Guard crewman Bernie Webber and his team save 32 crewmen from an oil tanker split in half during a storm, using only their knowledge of the coast to navigate and return home.

Flamboyant television financial expert Lee Gates is in the midst of the latest edition of his show, Money Monster. Less than 24 hours earlier, IBIS Clear Capital's stock inexplicably cratered, apparently due to a glitch in a trading algorithm, costing investors $800 million. Lee planned to have IBIS CEO Walt Camby appear for an interview about the crash, but Camby unexpectedly left for a business trip to Geneva.

Midway through the show, a deliveryman wanders onto the set, pulls a gun and takes Lee hostage, forcing him to put on a vest laden with explosives. He is Kyle Budwell, who invested $60,000—his entire life savings—in IBIS after Lee endorsed the company on the show. Kyle was wiped out along with the other investors. Unless he gets some answers, he will blow up Lee before killing himself. Once police are notified, they discover that the receiver to the bomb's vest is located over Lee's kidney. The only way to destroy the receiver—and with it, Kyle's leverage—is to shoot Lee and hope he survives.

With the help of longtime director Patty Fenn, Lee tries to calm Kyle and find Camby for him, though Kyle is not satisfied when both Lee and IBIS chief communications officer Diane Lester offer to compensate him for his financial loss. He also is not satisfied by Diane's insistence that the algorithm is to blame. Diane is not satisfied by her own explanation, either, and defies colleagues to contact a programmer who created the algorithm, Won Joon. Reached in Seoul, Joon insists that an algorithm could not take such a large, lopsided position unless someone meddled with it.

Lee appeals to his TV viewers for help, seeking to recoup the lost investment, but is dejected by their response. New York City police find Kyle's pregnant girlfriend Molly and allow her to talk to Kyle through a video feed. When she learns that he lost everything, she viciously berates him before the police cut the feed. Lee, seemingly taking pity on Kyle, agrees to help his captor discover what went wrong.

Once Camby finally returns, Diane flips through his passport, discovering that he did not go to Geneva but to Johannesburg. With this clue, along with messages from Camby's phone, Patty and the Money Monster team contact a group of Icelandic hackers to seek the truth. After a police sniper takes a shot at Lee and misses, he and Kyle resolve to corner Camby at Federal Hall National Memorial, where Camby is headed according to Diane. They head out with one of the network's cameramen, Lenny, plus the police and a mob of fans and jeerers alike. Kyle accidentally shoots and wounds producer Ron Sprecher when Ron throws Lee a new earpiece. Kyle and Lee finally confront Camby with video evidence obtained by the hackers.

It turns out that Camby bribed a South African miners' union, planning to have IBIS make an $800 million investment in a platinum mine while the union was on strike. The strike lowered the mine's owners stock, allowing Camby to buy it at a low price. If Camby's plan had succeeded, IBIS would have generated a multibillion-dollar profit when work resumed at the mine and the stock of the mine's owner rose again. The gambit backfired when the union stayed on the picket line. Camby attempted to bribe the union leader, Moshe Mambo, in order to stop the strike, but Mambo refused and continued the strike, causing IBIS' stock to sink under the weight of its position in the flailing company.

Despite the evidence, Camby refuses to admit his swindle until Kyle takes the explosive vest off Lee and puts it on him. Camby admits to his wrongdoing to Kyle on live camera. Satisfied with the outcome, Kyle throws the detonator away, then much to Lee's dismay gets fatally shot by the police. In the aftermath, the SEC announces that IBIS will be put under investigation, while Camby is charged with violations of the Foreign Corrupt Practices Act.

A financial TV host is taken hostage while on air by a man whose life savings were lost when the stock of a company the host had endorsed crashed.

A disgruntled investor takes a financial TV host hostage on air, demanding the truth behind a recent stock market crash, ultimately unraveling a corporate conspiracy.

When the Little Red-Haired Girl moves into his neighborhood, Charlie Brown becomes infatuated with her, though worries his long-running streak of failures will prevent her from noticing him. After Lucy tells him he should try being more confident, Charlie Brown decides to embark upon a series of new activities in hope of finding one that will get the Little Red-Haired Girl to notice him. His first attempt is to participate in the school's talent show with a magic act, helped by Snoopy and Woodstock. However, when Sally's act goes wrong, Charlie Brown sacrifices his time for her, rescues his sister from being humiliated, and is humiliated himself in return. Attempting to impress the Little Red-Haired Girl with his dance skills, Charlie Brown signs up for the school dance and gets Snoopy to teach him all his best moves. At the dance, Charlie Brown attracts praise for his skills but slips and sets off the sprinkler system, causing the dance to be cut short and all the other students to look down upon him once more.

Charlie Brown is partnered with the Little Red-Haired Girl to write a book report. At first, he is excited to have a chance to be with her, but she is called away for a week to deal with a family illness, leaving Charlie Brown to write the report all by himself. Hoping to impress both the Little Red-Haired Girl and his teacher, Charlie Brown writes his report on the collegiate-level novel War and Peace. At the same time, Charlie Brown finds he is the only student to get a perfect score on a standardized test. His friends and the other students congratulate him, and his popularity begins to climb. When he goes to accept a medal at a school assembly, however, he learns the test papers are accidentally mixed up and the perfect score actually belongs to Peppermint Patty; Charlie Brown declines the medal, losing all his new-found popularity. His book report is destroyed by a Red Baron model plane, and he admits to the Little Red-Haired Girl he has caused them to both fail the assignment.

Before leaving school for the summer, Charlie Brown is surprised when the Little Red-Haired Girl chooses him for a pen pal. Linus convinces Charlie Brown he needs to tell the Little Red-Haired Girl how he feels about her before she leaves for the summer. Racing to her house, he discovers she is about to leave on a bus for summer camp. He tries to chase the bus but is prevented from reaching it. Just as he is about to give up, thinking the whole world is against him, Charlie Brown sees a kite fall from the Kite-Eating Tree. The string becomes entangled around his waist and sails away with him. Amazed to see Charlie Brown flying a kite, his friends follow.

Upon reaching the bus, Charlie Brown finally asks the Little Red-Haired Girl why she has chosen him in spite of his failures. The Little Red-Haired Girl explains she admires his selflessness and his determination and praises him as an honest, caring, and compassionate person. The two promise to write to one another; the other children congratulate him as a true friend and carry him off.

In a subplot, after finding a typewriter in the school dumpster, Snoopy writes a novel about the World War I Flying Ace, trying to save Fifi from the Red Baron with Woodstock and his friends' help, using the key events and situations surrounding Charlie Brown as inspiration to develop his story. He acts out his adventure physically, pulling himself across a line of lights, and, imagining it as a rope across a broken bridge, he comes across Charlie Brown and the gang several times along the way. Snoopy defeats the Red Baron and rescues Fifi from an airplane. When Lucy finishes reading, she calls it the dumbest story she has ever read, so Snoopy throws the typewriter at her in retaliation and kisses her nose causing her to run away in disgust yelling that she has "dog germs".

Charlie Brown tries to impress the Little Red-Haired Girl in various ways, but fails; however, she admires him for his kindness and they become pen pals.

Charlie Brown embarks on various activities to gain the attention of the Little Red-Haired Girl, eventually winning her admiration through his selflessness and determination.

For additional learning and to view another model comparison example, check out our recent blog post on GPT-4 vs PaLM.


In conclusion, the availability of off-the-shelf foundation models has introduced breakthroughs in AI development by reducing the barrier to kick-start model development. Teams of any AI maturity can speed up their development process by selecting the most suitable foundation model for their specific use case. The key to maximizing iteration cycles lies in selecting the right algorithm and conducting a comprehensive model comparison process to assess the model’s alignment with business goals. While confidently evaluating pre-trained models can be challenging, Labelbox Model and the upcoming Model Foundry provide ML teams with the tools to effectively compare, evaluate, and leverage foundation models for AI development.

Labelbox recently announced early access to Model Foundry – the easiest place to build model prediction workflows based on your use case, integrate model results with your labeling workflow, and compare and evaluate model results.

Join the waitlist today and get notified when the beta becomes available for use.