How to automate labeling for product categorization with LLMs

This is the second post in a blog series by Karen Yang, Senior AI PM, showcasing how Model Foundry brings the power of foundation models into Labelbox.

Overview

Product tagging and categorization are essential components for the modern shopping experience. It powers intuitive product search results, personalized recommendations, product exploration, marketing campaigns, and is an ad engine.

Product categorization can be a labor-intensive process, as the tags and categories are usually created manually or crowdsourced by questionnaires from sellers and then moved through a review process. A popular e-commerce website can easily have tens of thousands of product listings with hundreds of categories to choose among. Reading over each of the products’ descriptions and deciding on the correct category can be a daunting task.

In recent years, large language models (LLMs) have become very powerful at common sense reasoning, contextual understanding, and information processing. They often have a strong zero or few shot performance on various use cases. In the case of product categorization, many LLMs are exposed to large amounts of similar data during their web-scale training process. Therefore, they generalize well for understanding and extracting product features, target audiences, and uses.

In this blog post, we will walk through how to use LLMs to instantly categorize products and provide relevant product summaries from their HTML text descriptions. By leveraging LLMs, product categorization can be largely automated with high accuracy, allowing subject matter experts to focus their attention on model evaluation rather than reading every single sentence.

Experiment setup

For this example, we took samples of sixteen product categories from the Etsy retail products dataset on Kaggle. It contains a wealth of information, including long-form product descriptions, categories, reviews, sellers, and other relevant information. For this use case, we are only interested in assessing an LLM’s ability to predict the product categories from the long description, as well as their ability to generate a summary (or name) to highlight the product’s features.

The 16 classes are:

["Toys & Games < Games & Puzzles",
'Art & Collectibles < Photography',
'Weddings < Decorations',
'Weddings < Gifts & Mementos',
'Art & Collectibles < Painting',
'Toys & Games < Sports & Outdoor Recreation',
"Shoes < Women's Shoes",
'Art & Collectibles < Drawing & Illustration',
'Accessories < Hair Accessories',
'Toys & Games < Toys',
'Accessories < Baby Accessories',
'Pet Supplies < Urns & Memorials',
'Art & Collectibles < Prints',
"Shoes < Girls' Shoes",
"Shoes < Unisex Kids' Shoes",
'Weddings < Accessories']

Some of the categories are within the same high-level category group and thus are semantically similar to each other. Appropriate classification requires contextual understanding rather than simple keyword searches to get good classification results.

Model and prompt set up

With Labelbox’s soon-to-be released Model Foundry, you will have access to a range of popular models, including third-party APIs and open source models. You can browse available models and read more about their intended uses and feasible ontology here.

In this experiment, we will use ChatGPT (gpt-3.5-turbo) as a demonstration, since its prompt-based interface makes it flexible to describe our problem and preference, and it can process both the classification and the summary generation tasks together.

You can leverage foundation model predictions on product classifications with Labelbox's Model Foundry.

Creating effective prompts for large language models that work well for your use cases is an art. Model Foundry makes it easy to generate a prompt template that works out of the box. All you need to do is define the ontology and optionally provide some instructions and examples – similar to how you would set up a labeling project. As you can see in the screenshot above, you first define the ontology for this model job. In this example, we asked the model to do a radio classification of "ecommerce_product_categories", with the sixteen classes as options, and we also asked the model to provide a summary, formulated as a free text question.

Once you have selected your ontology, a prompt template will automatically be populated, as seen in the ‘Prompt’ text box. This auto-configured prompt template explains your goal to the model, and asks the model to return it in a structured format so that it can easily be parsed and integrated into a downstream workflow. If you want to run some prompt engineering and add additional context, you can freely modify the template.

Another part of the template is the optional ‘Examples’ section. You can add a few example pairs of product descriptions with the expected answer, to inform the models what your preferences are. Studies have shown that LLMs can improve performance with this kind of few-shot in-context learning.

Lastly, you can change the hyperparameters of the chosen model if you don’t want to use the default ones. For example, with ChatGPT, you can set temperature, max_tokens, top_p, etc.

You can generate a preview on a small sample of your selected data. In the image above, you can see some of the generated previews. This allows you to understand whether the model configuration meets expectations on a small sample of data before bulk processing. Based on the preview results, you can go back and modify the model parameters and prompts to your satisfaction. Once you are happy with the model configuration, you can submit all the data rows for inference.

Results

Visualize model predictions and send to project as pre-labels

Once the model job is done, you can visualize and evaluate the results of the ChatGPT on product categorization and summary in Labelbox Catalog and Model.

ChatGPT results shown in Catalog and ready to be sent to project as pre-labels.

Evaluating model results

To conduct an analysis of a model, it is helpful to compare the predictions of each model with some known ground truth labels side-by-side. It is also essential to conduct a holistic metric analysis that evaluates the model's overall performance on the task. Labelbox’s model run page gives you such a comprehensive view. At the end of each Model Foundry job, a model run will be automatically populated for your analysis.

For example, say we have created and labeled a small set of known hard cases as the golden evaluation ground truth labels to test model performance. We can upload the ground truth labels to the chatGPT run 1 model run we just created via Model Foundry. Now we are ready to evaluate the qualitative and quantitative performance of ChatGPT on our task.

Quantitative evaluation

Let’s take a look at the metrics tab first. Here, we can quickly see where the model got things wrong. For example, the model seems to have most of its mistakes in the Shoes category. It has mistaken products from Shoes < Unisex Kids’ Shoes as an Accessories < Baby Accessories product five times, and Girls’ Shoes to Baby accessories three times. It has twice mistaken a product in Wedding < Accessories to be in Hair Accessories.

Confusion matrix that can be found in Labelbox Model's metrics tab.

The model run metrics page also auto-computes the precision, recall, and F1 scores for your model’s given ground truth labels, making it easy to evaluate and compare the quantitative performance of your model. In this experiment, we can see that the model has a 80%+ recall score.

Meanwhile, from the confusion matrix, we can see the model tends to struggle in classes with similar meanings. To dig in further, we can switch to a qualitative view to read the actual places where the mistakes happen, and understand why the model made the mistakes.

Qualitative evaluation

In Model, you can filter the data rows based on metrics. Instead of reading through every data row, you can hone in on just the problematic ones. For example, to understand the false positives for the class Shoes < Girls’ Shoes, you can quickly build and use a metrics filter in Labelbox Model to filter for exactly that, and it will automatically pull out the few data points that meet that criteria.

You can filter on varoius metrics within a model run in Labelbox.

With the information we have now, it’s easy to see why the model is confused: The product descriptions fit into both categories.

Here is one example: the ground truth label (taken from the original Kaggle dataset) categorizes the product as Girls’ shoes, and the model categorizes it as a Baby accessory — and both are technically accurate. In this case, what should we do if we want to improve the model further? We can clarify the model’s criteria by creating more specific instructions and examples, change the category terminology and domain so that they overlap less, or update the ontology so that one product can be listed in multiple categories.

Click on a data row to see a more detailed view on the model's performance.

Now let's take a look at whether the model is doing a good job summarizing the product descriptions into short phrases. These descriptions can be used for naming or search relevance. ChatGPT can generate succinct and accurate short product descriptions.

Original Product description	ChatGPT summary
A bespoke, hand-drawn portrait of your beloved pet, captured in graphite on 4x4inch heavy cartridge paper. The perfect gift for any pet owner or animal lover. Item Description My graphite portraits are created using a high quality, graphite and sealant, this means the lines remain saturated and crisp (when stored and displayed as recommended). Each piece is captured on quality 220gsm, heavy cartridge paper, this offers a smooth surface which allows me to capture the little details that make your pet uniquely them. What You Will Receive A high quality, custom drawing of your precious pet illustrated using the photos provided. My 4x4inch graphite portraits feature the head and neck of your pet only. If you desire a full body illustration please contact me in advance via messenger or Instagram and we can discuss options…	Custom pet portrait in graphite on 4x4 inch heavy cartridge paper
Listing is for an electronic pdf file of plans only - not actual dollhouse This listing is for a pdf file American Girl Doll or 18 inch doll House plans. These plans are for a 4 room version. This would be built as one piece instead of four seperate pieces as in our one room module plans. We will provide email support for questions when building the dollhouse. Finished dollhouse measures 4 feet 9 inches tall, 5 feet 3 inches wide and 2 feet deep. Each lower room is 2'-6" wide x 2' tall x 2' deep. Upper room varies in height…	American Girl Dollhouse Plans
This nursery tent is made for little boys and girls looking for adventure inside or out. Gray star exterior and stripes turquoise play mat brings freshness to every room. Arrange a bedtime story, or simply cuddle together inside this turquoise blue teepee for a lifetime of cherished childhood memories…	Gray star teepee with turquoise play mat and pillows
Bridesmaid tote bags , bridesmaid gifts , tote bag , beach bag , bachelorette party gift ,wedding bag , maid of honor I make 100% Jute Burlap Tote Bags. Best quality and affordable price for any occasion. 1. Bag Size:- 15 Inches Width and 13.5 Inches Height and 6 Inches Gusset (Depth). - 40 centimetre Soft Cotton Padded Handles 2. Bag Fabric:- 100% Jute Burlap Fabirc …	Bridesmaid tote bags

Conclusion

We’ve walked through the process of using a language model (ChatGPT) to generate predictions on product categorization and summary, and evaluate them in depth. The entire process, from problem definition to generating pre-labels, and running evaluation took less than 10 minutes. In our analysis, we see that the model gave largely satisfactory results in a zero-shot setting. This shows the potential of applying large language models to automate the bulk part product categorization tasks.

One of the most important components of building a productive model automation workflow is a quick iteration cycle for improving model prompts and configurations and evaluating the model results. With Labelbox Model, we can gain a more transparent and holistic view of the model performance. By zooming into the classes and data rows where the model failed, we can quickly understand that most of the model’s mistakes were due to the ambiguity of the category definition or because a product fits into multiple categories.

From there, subject matter experts can further improve and guide the model prediction process, by adding more examples in the prompts, iterating on the ontology and instructions, or fine-tuning the model if the previous efforts are not enough. With a few iterations, they can improve the model to be more accurate and aligned with their preferences in this use case.

Labelbox recently announced early access to Model Foundry – the easiest place to build model prediction workflows based on your use case, integrate model results with your labeling workflow, and compare and evaluate model results.

Join the waitlist and stay tuned for the next blog on how to use Model Foundry to enrich images and videos for computer vision use cases.