logo
×

End-to-end workflow with model distillation for computer vision

Model distillation, also known as knowledge distillation, is a technique that focuses on creating efficient models by transferring knowledge from large, complex models to smaller, deployable ones. 

Model distillation is one of many techniques that have grown in popularity and importance in enabling small teams to leverage foundation models in developing small (but mighty) custom models, used in intelligent applications.

In “A Pragmatic Introduction to Model Distillation for AI Developers”, we illustrated some of the conceptual foundations for how model distillation works. 

We described how distillation can be leveraged in any domain or data modality requiring efficiency and model optimization, whether the use case is computer vision or NLP related

In this tutorial we’ll demonstrate an end-to-end workflow for computer vision, using model distillation to fine-tune a YOLO model with labels created in Model Foundry using Amazon Rekognition. 

We’ll show how easy it is to go from raw data to cutting-edge models, customized to your use case, using a fashion products dataset (additional public datasets can be found here or on sites like Kaggle). 

In less than 30 min you’ll learn how to:

  • Ingest, explore and prepare the products fashion dataset;
  • Pick and configure any image based foundation model to automatically label data with just a few clicks;
  • Export the labeled predictions to a project as a potential set-up for manual evaluation;
  • Use the labeled predictions dataset to fine-tune a student model using a cloud-based notebook provider;
  • Evaluate performance of the fine-tuned student model.

At the end of the tutorial we’ll also discuss advanced considerations in scaling your models up and out, such as automating data ingestion and labeling, and resources for incorporating RLHF  into your workflows. 

See it in action: How to use model distillation to fine-tune a small, task-specific Computer Vision Model

The walkthrough below covers Labelbox’s platform across Catalog, Annotate, and Model. We recommend that you create a free Labelbox account to best follow along with this tutorial. You’ll also need to create API keys for accessing the SDK.

Notebook: CV YOLO Model Distillation


Overview

The Model Distillation Workflow for Computer Vision

In our prior post on model distillation concepts, we discussed the different model distillation patterns, based on the following criteria:

  • The type of Teacher model architecture used;
  • The type of Student model architecture used;
  • The type of knowledge being transferred from teacher to student model(s) (response-based, feature-based, relation-based knowledge);
  • The cadence and scheme for how the student mode is trained (offline, online, self).

In this tutorial we’ll be demonstrating the most popular and easiest pattern to get started with: offline, response-based model (or knowledge) distillation

The teacher model we’ll be using to produce the responses is Amazon Rekognition and the student model is YOLOv8 Object Detection

As you’ll see, we could have chosen any combination of teacher or student models, because the offline, response-based pattern of model distillation is incredibly flexible. 

When implementing this process for your own use case, it’s important to understand the relative strengths and weaknesses of each model and match them according to your requirements.

Using The Labelbox Platform To Automate Model Distillation 

Labelbox is the leading data-centric AI platform, providing an end-to-end platform for curating, transforming, annotating, evaluating and orchestrating unstructured data for data science, machine learning, and generative AI. 

The Labelbox platform supports the development of intelligent applications using the model distillation and fine-tuning workflow.

AI developers are enabled to easily:


Introduction To Data Preparation for Computer Vision With Catalog

Before beginning the tutorial:

Once you’re able to see your dataset in Labelbox Catalog, you’ll be able to do the following: 

For additional details on how to use Catalog to enable data selection for downstream data-centric workflows (such as data labeling, model training, model evaluation, error analysis, and active learning), check out our documentation


Using A Large Computer Vision Model To Generate And Distill Predictions

The first step of model distillation is to identify an appropriate teacher model, which will be used to produce responses that, when combined with the original images, will serve as the fine-tuning dataset for the student model.

Response-based model distillation is powerful because it can be used even when access to the original model weights is limited (or the model is so big that downloading a copy of the model would take a really long time). Response-based distillation also doesn’t require the user to have trained the model themselves; just that the model was pre-trained.  

Labelbox allows you to pick any of the currently hosted, state-of-the-art models to use (as well as upload your own custom models) to use as the teacher model. 

For now, let’s get started with preparing the images we’ll be labeling, or generating predictions with, using Amazon Rekognition. The combination of image and label pairs will be used for YOLO.

Step 1: Select images and choose a foundation model of interest 

Steps:

  • Navigate to the fashion products dataset in Catalog.
  • To narrow in on a subset of data, leverage Catalog’s filters including media attribute, a natural language search, and more, to refine the images on which the predictions should be made.
  • Once you’ve surfaced data of interest, click “Predict with Model Foundry”.
  • You will then be prompted to choose a model that you wish to use in the model run (in this case Rekognition).
  • Select a model from the ‘model gallery’ based on the type of task - such as image classification, object detection, and image captioning.

Step 2: Configure model settings and submit a model run

When developing ML based applications, developers need to quickly and iteratively prepare and version training data, launch model experiments, and use the performance metrics to further refine the input data sources. 

The performance of a model can vary wildly depending on the data used, the quality of the annotations, and even the model architecture itself. A necessary requirement for replicability is being able to see the exact version of all the artifacts used or generated as a result of an experiment. 

Labelbox will snapshot the experiment, the data artifacts as well as the trained model, as a saved process known as a model run

This includes the types of items the model is supposed to identify and label, known as an ontology.   

Each model has an ontology defined to describe what it should predict from the data. Based on the model, there are specific options depending on the selected model and your scenario. 

For example, you can edit a model ontology to ignore specific features or map the model ontology to features in your own (pre-existing) ontology.

Each model will also have its own set of settings, which you can find in the Advanced model setting.

Steps:

  • Once you’ve located Rekognition, you can click on the model to view and set the model and ontology settings or prompt. 
  • In this case, we set the ontology to detect “Jacket” and we can see a preview of running the model on this ontology above. 
  • To get an idea of how your current model settings affect the final predictions, you can generate preview predictions on up to five data rows.

While this step is optional, generating preview predictions allows you to confidently confirm your configuration settings:

  • If you’re unhappy with the generated preview predictions, you can make edits to the model settings and continue to generate preview predictions until you’re satisfied with the results. 
  • Once you’re satisfied with the predictions, you can submit your model run.

Step 3: Review predictions in the Model tab 

Because each model run is submitted with a unique name, it’s easy to distinguish between each subsequent model run.  

When the model run completes, you can:

Steps:

  • We can see that for the most part, “jackets” were correctly identified and labeled.
  • These generated labels are now ready to be used for fine-tuning the student model YOLO.

Step 4: Enriching and evaluating predictions using human-in-the-loop and Annotate 

Although fine-tuning a foundation model requires less data than pre-training a large foundation model from scratch, the data (specifically the labels) need to be high-quality. 

Even big, powerful foundation models make mistakes or miss edge cases. 

You might also find that there are additional objects that the parent model didn’t identify because the items in the image weren’t initially identified as being important in the ontology (for example, “boots” or “ski hats”. 

Once a parent model like Rekognition has been used for the initial model-assisted labeling run, those predictions can then be sent to a Labelbox project, a container where all your labeling processes happen

Steps:

  • In this case, we feel fairly confident in how well Rekognition performed so we’ll send the inferences to the corresponding Labelbox project and treat them as the ground truth that the student model will be fine-tuned on.

Fine-Tuning The Student Model (YOLO) 

We’ve shown the first half of the model distillation to fine-tuning workflow. 

  • We identified the items we wanted the parent model (Rekognition) to detect and label in the form of an ontology. 
  • We used Rekognition to automatically label items like “jackets”.
  • We exported the generated labels to a labelbox project, at which point we could review the labels manually and enrich them further using the Labelbox editor. 

The next step is to use the generated labels, along with the original image, to fine-tune a student model in Colab. 

Note: You’ll now need the API keys from earlier to follow along with the Colab notebook.

Step 5: Fetch the ground truth labels from the project via Labelbox SDK and convert the images into the relevant format.

For brevity, we’ve omitted the relevant code samples but you can copy or run the corresponding blocks in the provided notebook. 

Check out our documentation to find out all the ways you can automate the model lifecycle (including labeling) using our SDK. 

Steps:

Step 6: Fine-tune student YOLO model using labels generated by Amazon Rekognition

Steps:

See the example notebook for omitted code.

Step 7: Create a model run with predictions and ground truth 

Oftentimes the initial training or fine-tuning step isn’t the final stop on the journey of developing a model. 

One of the biggest differences between the traditional method of training models in the classroom versus the real-world is how much control you have over the quality of your data, and consequently the quality of the model produced. 

As we mentioned earlier, developers can upload predictions and use the Model product to diagnose performance issues with models and compare them across multiple experiments.

Doing so automatically populates model metrics that make it easy to evaluate the model’s performance.

Steps:


Evaluating Model Performance

There’s no single metric to rule them all when evaluating your computer vision model performance

With that being said, Model offers a number of the most common out-of-the-box. With the ‘Metrics view’ users can drill into crucial model metrics, such as confusion matrix, precision, recall, F1 score, false positive, and more, to surface model errors.

Model metrics are auto-populated and interactive, which means you can click on any chart or metric to immediately open up the gallery view of the model run and see corresponding examples, as well as visually compare model predictions between multiple model runs. 

Step 7: Evaluate predictions from different YOLO model runs in Labelbox Model 

Steps:

  • Navigate to “Model”
  • By fine-tuning the Yolo v8 model with approximately 1000 images annotated using Amazon Rekognition, we can achieve performance similar to the Rekognition model within roughly one hour.
  • We can now manually inspect examples of predictions from the fine-tuned YOLO model
Beautiful tailoring, clearly a gorgeous jacket.
Dark, grungy, clearly a jacket with attitude and presence.
In this instance, the fine-tuned YOLO model detected one more instance of a jacket than the Amazon Rekognition model.
Another instance where the fine-tuned YOLO model detected an additional instance of a jacket than the Amazon Rekognition model. It seems that the Amazon Rekognition model doesn’t do as well at detecting furry jackets.

Advanced Considerations

In this step-by-step walkthrough, we’ve shown how anyone with any image-based dataset can leverage an image-based foundation model to label, fine-tune and analyze a smaller but mighty custom model. 

Additional considerations users should address for scaling similar projects include:

  • Automating future data ingestion, curation, enrichment, and labeling when the fine-tuned model needs to be retrained due to drift.
  • Incorporating human evaluators and human-in-the-loop, as well as error analysis for identifying and addressing edge cases. 
  • Easy-to-use, user interface that can be customized for various modalities of data when multiple users are involved. 
  • A robust SDK for integrating with any MLOps solutions provider, especially when incorporating model monitoring and complex deployment patterns. 

Even with fine-tuned models, there’s no such thing as “setting and forgetting”. 

All models eventually need to be retrained, with the data refreshed to account for changes.


Conclusion

In this tutorial, we demonstrated an end-to-end workflow for computer vision, using model distillation to fine-tune a YOLO model with labels created in Model Foundry using Amazon Rekognition. 

Hopefully you were able to see how easy it is to go from raw data to cutting-edge custom models in less than 30 min. 

You learned how the Labelbox platform enables model distillation by allowing developers to: 

  • Ingest, explore and prepare image-based datasets using Catalog;
  • Use any image based foundation model to automatically label data using Model Foundry as well as how to incorporate human-in-the-loop evaluation using Annotate;
  • Export these labeled predictions to a cloud-based training environment for fine-tuning;
  • Automate the various workflows using the Labelbox SDK;
  • Evaluate model performance and analyze model errors using Labelbox Model. 

In the next part of this series we replicate a very similar workflow for NLP, using model distillation to fine-tune a BERT model with labels created in Model Foundry using PaLM2.