Manu SharmaApril 19, 2023

GPT-4 vs PaLM: Assessing the predictive and generative performance of LLM models


Large language models (LLMs), such as GPT-4 from OpenAI, Bart, and Llama, have demonstrated remarkable abilities to engage in creative and compelling conversations, distill and summarize information from contextual cues, and deliver exceptional zero-shot performance on a wide range of predictive and generative tasks on previously unseen data.

If you have experimented with LLMs, you know that while undoubtedly powerful, they do also have their limitations. One primary drawback is their inability to stay up to date with the latest information, resulting in responses that may be out of date or incomplete. Additionally, the accuracy and appropriateness of model outputs can vary based on the prompt given. In some cases, LLMs can even “hallucinate” answers or produce results that are not relevant to the query.

The above are just some reasons why businesses might hesitate before trusting LLMs for automating internal workflows or customer-facing functions. While many people have encountered LLMs through the chat interfaces of OpenAI and Bart or through demos of LLM applications, such experiences don’t address critical questions such as:

  • How will LLMs perform on new data and help solve my unique use case?
  • How can I quickly experiment and compare different LLMs on my own data at scale in order to test which is the best for my use case?

In this blog post, we aim to put GPT-4 and PaLM — two of the largest and most powerful LLMs developed to date - to the test. We will provide a step-by-step walkthrough on how to apply GPT-4 and PaLM to a custom text dataset and systematically evaluate and compare their zero-shot predictive accuracy and generative ability. By doing so, we hope to help you learn how to confidently assess the potential and limitations of LLMs as well as inspire new use cases.

Experiment setup

Dataset and problem setup

We sampled 100 data points from the Wikipedia Movie Plots dataset from Kaggle. It contains a wealth of information, including long-form movie plots, genres, and other relevant information. For this use case, we are interested in assessing an LLM’s ability to predict the movie genres from the movie plots, as well as their ability to generate an appropriate and succinct summary.

We filtered the original dataset to include only the following categories: 'comedy', 'animated', 'sci-fi', 'thriller', 'action', 'family', 'fantasy', 'horror', 'adventure', 'drama'. Since a movie can fall into multiple genres, the task is classified as a multi-classification problem. The models will be evaluated on its precision, recall, F1 scores, and confusion matrix. The model will be evaluated on its precision, recall, F1 scores, and confusion matrix.

To evaluate summary generation, we tasked both LLMs with generating a short summary and manually read through them to evaluate them on three dimensions: clarity, accuracy, and conciseness.

Model and prompt setup

We chose to compare GPT-4 and PaLM models as they are among the most powerful generative models currently available and have APIs that support bulk inference. Creating effective prompts for large language models is an art, with various prompt engineering techniques available. For the purpose of this blog, we opted to use simple prompt templates that clearly describe the ML task and return format (a structured JSON format). This willl also allow us to easily integrate the prompts into downstream workflows. To create a fair comparison, we used identical prompts for both models during our evaluation.

The prompt without examples:

For this movie plot description, describe plot_summary, or answer N/A if you are not confident.The plot summary should be short 1 sentence description. Classify movie genres by picking one or more of the options: [comedy, animated, sci-fi, thriller, action, family, fantasy, horror, adventure, drama]. 
Return the result as a json: {"plot_summary" : "<answer>", "movie_genres" : ["<prediction>"]} 

{insert movie plot}

In this prompt, we asked the LLM to provide a short one sentence summary and classify the movie genres. We requested the output to be returned in a structured JSON format with the actual text asset inserted at the end.

We integrated each text data asset into the prompt template above and leveraged GPT-4 and PaLM models for inference. We then uploaded the resulting model outputs into a model run in Labelbox Model.

Findings and results

Visualize model predictions and metrics

We can now evaluate the results of the GPT-4 and PaLM models. To conduct a comprehensive analysis of a model, it is helpful to compare the predictions of each model with the original ground truth labels side-by-side. It is also essential to conduct a holistic metric analysis that evaluates the models overall performance on the task. To this end, we uploaded both models’ predictions and original ground truth labels to a Labelbox model run so that we could easily visualize the predictions, and see auto-generated quantitative model metrics.

A populated model run in Labelbox Model with predictions and original ground truth labels for comparison. 

PaLM offers a unique feature that generates multiple safety attributes scores based on how likely the generated content is hateful, violent, sexual, and contains political or medical topics. This is helpful in customizing content moderation and in understanding the distribution of response attributes. By clicking on an asset to examine its prediction and metrics, you can easily view these scores.

PaLM generates multiple safety attributes scores based on how likely the generated content is hateful, violent, sexual, and contains political or medical topics.

Quantitative comparison

Multi-label classification is a challenging task, especially in the context of movie genre classification, where the models must make zero-shot predictions on how movies have been classified by the dataset creator. In addition, the models must accurately predict only the relevant genres and exclude any irrelevant ones, making the classification task even more complex.

In our evaluation, we found that both models demonstrate impressive out-of-the-box zero-shot performance. PaLM scored higher on the overall precision score, whereas GPT-4 scored higher on overall recall scores.

Left: PaLM model, Right: GPT-4 model

To gain a deeper understanding of each model’s performance, it is important to analyze where the models are performing well and where they might be struggling. By conducting an analysis by class, we can gain insight into specific areas where the models are successful and where they are falling short.

The graphs provide valuable insight into the success and failures of the two language models.

For example, although both models perform well in terms of recall for ‘sci-fi’ and ‘drama’ genres, they have a low precision score, indicating that both models are overly confident in assigning these labels to movie plots. As a result, only a small portion of 'sci-fi' genre predictions correspond to the actual ground truth labels, contributing to the low precision score. This is also true for GPT-4’s performance on the ‘adventure’ genre.

On the other hand, both models are very good at classifying ‘fantasy’ and ‘comedy’ genres. The genres ‘animated’ and ‘family’ had a very small sample size of only one data point each, which isn’t sufficient for meaningful analysis.

Qualitative comparison

Let’s now take a look at how GPT-4 and PaLM summarize selected movie plots. Overall, both models performed well in being able to capture the movie plot in a single sentence, although they differed slightly in their level of conciseness and abstraction. PaLM was able to generate shorter summaries, while GPT-4 was able to convey more captivating plot details.

Check out a few examples below:

Original Movie plot

PaLM Summary

GPT-4 summary 

Vincent Chase, who separated from his wife after nine days of marriage, wants to do something new in his career. He calls his former agent-turned-studio head Ari Gold, who offers Vince a leading role in his first studio production. Vince says he will only star in it if he directs. Eight months later, Hyde, the directorial debut by Vince, is in the middle of post-production and is over budget by $15 million. Needing more money, Vince asks Ari if they can get an additional $10 million to complete the movie. Reluctant, but wanting to please him, Ari flies to Texas to meet the co-financiers of the movie, Larsen McCredle and his son Travis, to get them to give the money. Hesitant, Larsen sends Travis with Ari back to Los Angeles to see a cut of the film at Vince's private screening.

Vince has second thoughts on the rough cut and cancels the screening out of fear no one will enjoy it. He later gives Ari and Travis a copy of the movie for their viewing. Ari and Travis enjoy the movie but Travis requests that Ari cut the scenes with Vince's brother, Johnny "Drama" Chase, out of the movie or he will not give the film the extra money. With Ari giving him no answer, Travis puts the post-production on hold. Vince and Eric "E" Murphy visit Travis to come up with a solution, but this results in Travis saying he did not like Vince's performance either and wants the entire movie to be redone with a new lead and director. In addition, studio CEO John Ellis removes Ari from the production.

The boys soon realize the reason Travis is angry over the movie is because Vince is spending time with Emily Ratajkowski, with whom Travis is infatuated. Learning of this, Ari interrupts Ellis in a boardroom meeting with Larsen and Travis with this information. Surprised, Larsen says he will stay behind the movie but mandates the condition that Ari is no longer working for the studio because, although he agreed his son was being juvenile, Ari had disrespected his son. During this time, the boys are in the hospital as E's ex-fiancée Sloan goes into labor. Sloan gives birth to a girl, Ryan Murphy, and the boys celebrate.

Ari arrives at the hospital and also announces that he resigned and invested his severance to complete the movie. Hyde becomes a box office hit, grossing over $450 million worldwide. At the Golden Globe Awards, Drama ends up winning the award for Best Supporting Actor. Stunned, Drama goes to accept the award and is silent for a moment before yelling his trademark Viking Quest phrase, "VICTORY!!!"

In a mid-credits scene, Ari's former assistant Lloyd has his wedding and Ari gives him away. When the group gets together for a picture, Billy Walsh suggests that they should make a film (or TV show) about the lives of Vince and the gang.

Vince Chase, who separated from his wife after nine days of marriage, wants to do something new in his career.

A Hollywood actor directs his first movie, which goes over budget, leading to conflicts with financiers and a struggle to complete the film.

Jesse Owens, a promising black American runner, attends Ohio State University, despite facing racial discrimination and slurs from the white athletes. He attracts the attention of coach Larry Snyder, who believes Owens has enormous potential but needs work on his form and technique. When Snyder suggests he is good enough to compete at the 1936 Berlin Olympic Games, Owens is interested, but hesitant because of racial propaganda in Nazi Germany. The U.S. Olympic Committee is already considering boycotting the Olympics over Germany's discriminatory policies, only agreeing to participate when Nazi official Joseph Goebbels gives personal assurances they will allow foreign athletes of any race to compete, as well as promising to rein in their propaganda.

As Owens struggles to support his girlfriend Ruth and young daughter, he takes a job at a service station, upsetting Snyder when his job conflicts with his commitment to train. When Snyder learns Owens has a family to support, he gets him a no-show job, allowing him to focus on running. Owens goes on to break several records, including some of Snyder's, and begins a relationship with a woman whose attentions he attracts with his newfound fame. When Ruth threatens him with a lawsuit, Owens becomes distracted and loses a race to runner Eulace Peacock. Owens decides to return to Ruth to reconcile, convincing her to marry him. As the Olympics draw closer, the NAACP asks him not to go to Berlin for political reasons. Owens is conflicted, but Peacock urges him to compete to defy Nazi racial ideology.

In Berlin, Owens wins his first gold medal in the 100 m dash, but when he is brought by International Olympic Committee member Avery Brundage to receive congratulations from Adolf Hitler, he is told the Chancellor has left the stadium early to avoid traffic. Brundage warns Goebbels that Hitler must congratulate all winners or no winners, with Goebbels replying Hitler will not appear with "that". Owens next qualifies for the broad jump after German rival Luz Long unexpectedly gives him tips on his technique. Owens wins another medal, and Luz publicly shakes his hand and joins him for a victory lap, privately expressing concern about national politics. Owens wins his fourth and final medal in the 4 x 100 m relay, filling in for two Jewish American athletes cut by Brundage, who is convinced by Goebbels to do so to avoid a scandal over a business arrangement Brundage entered with the Nazis before the games. Director Leni Riefenstahl films the event against Goebbels' orders, then asks Owens to repeat his medal-winning broad jump to get a few more shots for her next film, Olympia.

Jesse Owens, a promising black American runner, attends Ohio State University, despite facing racial discrimination and slurs from the white athletes.

Jesse Owens, a black American runner, overcomes racial discrimination and personal struggles to win four gold medals at the 1936 Berlin Olympic Games, defying Nazi racial ideology.

The story follows Michelle Darnell (Melissa McCarthy), a titan of industry who is sent to prison for insider trading, denounced by her former lover, Renault (Peter Dinklage), who still holds a heavy grudge towards her for their breakup, after getting a promotion a few years ago. After doing her time, Michelle emerges, ready to rebrand herself as America's latest sweetheart, but not everyone she steamrolled is so quick to forgive and forget. With nowhere to go and no one to scam, Michelle is forced to move in with former assistant Claire (Kristen Bell) and her young daughter, Rachel (Ella Anderson). Now at her lowest point, Michelle wastes no time in devising a winner-take-all plan to rebuild her empire.

A titan of industry is sent to prison for insider trading and must rebuild her empire when she is released.

A former business tycoon, Michelle Darnell, tries to rebuild her empire after serving time in prison for insider trading.

Toula Portokalos-Miller's (Nia Vardalos) life is in shambles. Her travel agency and the family dry cleaners have closed due to the recession. The only business still open is the family restaurant that her father, Gus (Michael Constantine), still runs. Her husband, Ian (John Corbett), is the principal at their teen-aged daughter Paris's (Elena Kampouris) high school. Paris, who is applying to college, feels smothered by her close-knit clan, who constantly interfere in her life. Desperate for independence and privacy, she applies to schools across the country. Ian and Toula's marriage has become strained due to Toula's obsessive need to be involved in Paris's life and to "fix" whatever goes wrong in her family.

Meanwhile, Gus has convinced himself that he is directly descended from Alexander the Great and wants to write to an online ancestry site for confirmation. While sorting through his records, he discovers that his and Maria's (Lanie Kazan) marriage certificate was never signed by the priest, technically invalidating their union. His current priest refuses to sign it but agrees to perform a new ceremony. Gus insists that he and Maria must marry again after fifty years together, but Maria wants Gus to propose properly. Gus refuses, infuriating Maria, who refuses to go through with the ceremony. Meanwhile, when Toula and Ian are on a date night to rekindle their romance, their family catches them kissing in their car outside their house. After Gus lands in the hospital and Maria refuses to go, saying she is not his wife, Gus pleas for her to marry him again. This time she accepts.

Maria wants the wedding she never had and hires a wedding planner who quits after the rowdy family's choices become too outlandish. The whole family, including Ian's parents, Rodney and Harriet (Bruce Gray and Fiona Reid), and Angelo's business partner, Patrick, pitch in to make the wedding happen. Nick urges Angelo to tell his parents, Voula (Andrea Martin) and Taki (Gerry Mendicino), that Patrick is also Angelo's romantic partner. Gus's estranged brother, Panos (Mark Margolis) arrives from Greece as a surprise.

Paris has been accepted to Northwestern University in Chicago and NYU in New York City. She chooses Northwestern to please her mother, but Paris's great-grandmother (Bess Meisler) convinces her she should go to New York. Paris asks Bennett (Alex Wolff), a boy she has a crush on, to the prom. He is also Greek with an equally crazy Greek family. Prom is the same night as the wedding. Toula tells Paris she can go to the prom if she attends the reception later. En route to the church, Gus, Panos, and Taki arrive drunk after many shots of ouzo. Maria storms off to the vestry after seeing Gus acting foolishly, feeling he is not taking the wedding seriously. Panos tells Maria that Gus had confided to him his love for Maria, and the ceremony continues. Watching as Gus and Maria recite their vows, Ian and Toula privately renew theirs. At the prom, Paris and Bennett share their first kiss while slow-dancing.

At the wedding reception, Gus reads a letter from the ancestry site verifying that he is a descendant of Alexander the Great. Ian, however, realizes that Toula forged the letter to make her father happy. The movie ends with the entire family dropping Paris off at her college dorm in New York.

The Portokalos family prepares for the wedding of their daughter, Paris, and the renewal of their own vows.

A Greek family comes together for a wedding while dealing with personal struggles, including a daughter's desire for independence and a couple's strained marriage.


Now, let’s consider and evaluate the performance of GPT-4 and PaLM in a real-world scenario. If you want to implement an automated workflow for tagging and summarizing text assets, such as movie plots in this case, both models could provide satisfactory results without any additional fine-tuning. However, the models' zero-shot predictions may not exactly match your requirements. For instance, both models are overly confident in predicting ‘sci-fi’ genre movies, leading to deviation from the original dataset’s labels. To address this issue, you could either perform more prompt tuning or fine-tune the models on several hundred examples of (movie plot, genre) pairs to train them to classify and summarize the texts according to your preferences.

In conclusion, both models perform well on multi-label classification and summarization tasks with zero-shot learning. PaLM outperforms GPT-4 in overall precision scores, while GPT-4 performs better in overall recall scores. Both models show similar performance, success and failure patterns, for certain movie genres. In terms of summarization, PaLM tends to produce shorter summaries, while GPT-4 is able to include more engaging plot details. A feature worthy of note is PaLM’s ability to attribute safety scores for its generated response, which can be greatly helpful for content moderation use cases.

Learn more about how to visualize and evaluate model performance in Labelbox Model: