The rise of off-the-shelf and foundation models has enabled AI teams to fine-tune existing models and pre-label data faster and more accurately — in short, significantly accelerating AI development. However, using these models at scale for building AI can quickly become expensive. One way to mitigate these costs and reduce waste is to add a model comparison process to your workflow, ensuring that any model you choose to integrate for AI development is the best choice for your requirements.
A comprehensive model comparison process evaluates models on various metrics, such as performance, robustness, and business fit. The results will enable teams to quickly kickstart model development, decrease time to value, and ensure the best results with less time and costs for their specific use case.
Embedding model comparison into the AI development workflow, however, comes with its own unique challenges, including:
In this blog post, we’ll explore how you can tackle these challenges for a computer vision use case with Model Foundry.
Selecting the most suitable off-the-shelf model is pivotal for ensuring accurate and reliable predictions tailored to your specific business use case, often leading to accelerated AI development. As different models exhibit diverse performance characteristics, diligently comparing the models’ predictions on your data can help distinguish which model excels in metrics such as accuracy, precision, recall, and more. This systematic approach to model evaluation and comparison enables you to refine the model’s performance with a “store of record” for future reference to continuously improve model performance.
Choosing the best off-the-shelf model provides a quick and efficient pathway to production, ensuring that the model aligns well with the business objectives. This alignment is crucial for the model's immediate performance and sets the stage for future improvements and adaptability to evolving requirements. The most suitable model for your use case also enables you to reduce the time and money spent on labeling a project. For instance, when pre-labels generated by a high-performing model are sent for annotation, less editing is required, making the labeling project quicker and more cost-effective. This is due to better Intersection Over Union (IOU) for tasks like Bounding Box, resulting in higher quality pre-labels and, therefore, fewer corrections. Furthermore, utilizing the best model can make your trove of data more queryable by enriching your data, thereby enhancing its searchability.
With Model Foundry, you can evaluate a range of models for computer vision tasks to select the best model to perform pre-labeling or data enrichment on your data.
Once you’ve located a specific model of interest, you can click into the model to view and set the model and ontology settings.
Each model has an ontology defined to describe what it should predict from the data. Based on the model, there are specific options depending on the selected model and your scenario. For example, you can edit a model ontology to ignore specific features or map the model ontology to features in your own (pre-existing) ontology.
Each model will also have its own set of hyperparameters, which you can find in the Advanced model setting. To get an idea of how your current model settings affect the final predictions, you can generate preview predictions on up to five data rows.
While this step is optional, generating preview predictions allows you to confidently confirm your configuration settings. If you’re unhappy with the generated preview predictions, you can make edits to the model settings and continue to generate preview predictions until you’re satisfied with the results. Once you’re satisfied with the predictions, you can submit your model run.
Each model run is submitted with a unique name, allowing you to distinguish between each subsequent model run. When the model run completes, you can:
You can repeat steps 1-5 with a different model, on the same data and for the same desired machine learning task, to evaluate and compare model performance. By comparing the predictions and outputs from different models, you can assess and determine which one would be the most valuable in helping automate your data labeling tasks.
To create a model run with model predictions and ground truth, users currently have to use a script to import the predictions from Model Foundry and ground truth from a project into a new model run.
In the near future, this will be possible via the UI, and the script will be optional.
After running the notebook, you'll be able to visually compare model predictions between two models. Use the ‘Metrics view’ to drill into crucial model metrics, such as confusion matrix, precision, recall, F1 score, and more, to surface model errors.
Model metrics are auto-populated and interactive. You can click on any chart or metric to open up the gallery view of the model run and see corresponding examples.
Select the best performing model and leverage the model predictions as pre-labels. Rather than manually labeling data rows, select and send a subset of data to your labeling project with pre-labels to automate the process.
From the metrics overview, we see that Google Cloud Vision outperforms Microsoft Azure AI for recall, f1 score, intersection over union and false negatives.
Microsoft Azure AI boasts a precision score of 0.8633, which outperforms the 0.7948 score of Google Cloud Vision. Microsoft Azure AI has an intersection over union score of 0.4034, an F-1 score of 0.7805, and a recall of 0.3852. In contrast, Google Cloud Vision exhibits a superior intersection over union score of 0.4187, an F-1 score of 0.7832, and a recall of 0.4149.
We can also see that the Microsoft Azure AI model has 12,665 more false negatives than Google Cloud Vision, and for our use case, we want the model with the least false negatives.
F1 scores for both Microsoft Azure AI and Google Cloud Vision Model are generally comparable, with a few instances showcasing superior performance by the Google Cloud Vision Model. Here are the specific results for each category:
Generally, the Google Cloud Vision Model exhibits superior performance in terms of intersection over union for classes such as train, boat, person, airplane, and bus. Intersection over union (IOU) is crucial as it dictates the accuracy of the bounding box prediction area.
In summary, Google Cloud Vision Model exhibits superior recall values across the categories of train, boat, person, airplane, and bus.
Let’s now take a look at how the predictions appear on the images below:
Here we can see that the model from Google Cloud Vision has a blue bounding box that properly captures the dimension of the airplane. Whereas Azure’s orange bounding box only covers ~3/4 of the airplane.
Another example where Google Cloud Vision in blue bounding box has better IOU than Azure model’s orange bounding box. Based on the qualitative and quantitative analysis above, Google Cloud Vision is the superior model compared to Microsoft Azure AI.
From the metrics overview, Amazon Rekognition generally demonstrates better performance in false negatives, true positives, recall, and IOU against Google Cloud Vision.
Amazon Rekognition outperformed Google Cloud Vision in the train, airplane, and bus categories
Amazon Rekognition led in the bus, boat and person categories,
Recall for Amazon Rekognition for boat, person, airplane, and bus, is better than Google Cloud Vision.
Here, Google Cloud Vision in the orange bounding box has detected only one boat, but Amazon Rekognition has detected 5 more boats, a person, and a car.
Amazon Rekognition in the blue bounding box has detected more boats and a clock tower.
Amazon Rekognition in the blue bounding box has an IOU of the truck with full coverage, whereas the IOU for Google Cloud Vision is around 90%.
The IOU for person is also better in addition to being able to detect cars in the background for Amazon Rekognition in the blue bounding box compared to Google Cloud Vision. Based on the analysis above, Amazon Rekognition is the best model for our use case.
Send the predictions as pre-labels to Labelbox Annotate for labeling
Since we've evaluated that Amazon Rekognition is the best model for our use case, we can send model predictions as pre-labels to our labeling project by highlighting all data rows and selecting "Add batch to project."
In conclusion, you can leverage Model Foundry to not only select the most appropriate model to accelerate data labeling, but to automate data labeling workflows. Use quantitative and qualitative analysis, along with model metrics, to surface the strengths and limitations of each model and select the best performing model for your use case. Doing so can help reveal detailed insights, such as seen in the above comparison between Google Cloud Vision and Amazon Rekognition. In the above model comparison example, we can see that Amazon Rekognition emerged as particularly well-suited for our project’s requirements and allows us to rapidly automate data tasks for our given use case.
Model Foundry streamlines the process of comparing model predictions, ensuring teams are leveraging the most optimal model for data enrichment and automation tasks. With the right model, teams can easily create pre-labels in Labelbox Annotate – rather than starting from scratch, teams can boost labeling productivity by correcting the existing pre-labels.
Labelbox is a data-centric AI platform that empowers teams to iteratively build powerful product recommendation engines to fuel lasting customer relationships. To get started, sign up for a free Labelbox account or request a demo.
Zero-Shot Learning vs. Few-Shot Learning vs. Fine-Tuning: A technical walkthrough using OpenAI's APIs & models
With large language models (LLMs) gaining popularity, new techniques have emerged for applying them to NLP tasks. Three techniques in particular — zero-shot learning, few-shot learning, and fine-tuning — take different approaches to leveraging LLMs. In this guide, we’ll walk through the key difference between these techniques and how to implement them. We’ll walk through a case study of extracting airline names from tweets to compare the techniques. Using an entity extraction dataset, we’ll be