LabelboxAugust 24, 2023

6 cutting-edge foundation models for computer vision and how to use them

Just two years ago, building and deploying a successful AI application involved extensive engineering. Teams had often built data and AI architecture, spent exorbitant amounts of time and effort in collecting, cleaning, and labeling data, and iteration often took weeks.

With the advent of powerful, off-the-shelf foundation models, AI builders entered a new era of work. Rather than developing custom models from scratch, we can now simply fine-tune a powerful, off-the-shelf model for their use case and business requirements. Builders of computer vision applications now have a plethora of foundation models to choose from, each with different strengths and uses. Finding the right model to start with can accelerate AI development, while making a suboptimal choice can result in lower performance and create delays.

In this post, we'll explore six of the most powerful foundation models available to AI builders, the use cases and applications they are best suited for, and how you can explore, test, and leverage them quickly and easily for pre-labeling data and AI development with Model Foundry.

Vision Transformer (ViT)

The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale — the first paper that successfully trained a transformer encoder on ImageNet, attaining strong results compared to familiar convolutional architectures.

ViT signifies a revolutionary transition in the field of computer vision, as it embraces the transformer architecture initially developed for tasks in natural language processing. ViT’s core concept involves segmenting images into consistent patches of predetermined size, making it particularly strong for tasks such as image classification or object detection

ViT has shown remarkable performance across various benchmark datasets, and often outperform traditional convolution neural networks (CNNs) in terms of accuracy. Its strength lies in extensive image recognition tasks, underscoring its ability to generalize effectively, even when working with limited data.

While ViT’s strength lies in capturing global context, it might not perform as well as CNNs on finer-grained details. In addition, ViT demands computational complexity and memory requirements, making it less efficient in handling extensive visual data when contrasted with CNNs.

Overall, the Vision Transformer model's capacity to capture global context in images opens doors to innovation in various industries, offering solutions that were previously challenging to achieve using conventional methods.

Sample use case:

In agriculture, ViT can be employed for crop disease detection. By analyzing images of crops, ViT can help identify signs of diseases, allowing farmers to take timely actions to protect their yields and optimize farming best practices.


Ultralytic’s YOLOv8 (You Only Look Once version 8) is a state-of-the-art object detection model designed to accurately identify objects within images or video frames. It leverages a combination of deep learning techniques, including CNNs and anchor-based detection, to achieve real-time object detection capabilities.

YOLOv8 offers unparalleled performance in terms of speed and accuracy, making it suitable for a wide range of applications such as surveillance, self-driving cars, and robotics. The model’s architecture, featuring multiple detection heads, enhances its precision in detecting multiple objects simultaneously in images or videos.

However, YOLOv8’s performance can falter when it comes to detecting small or densely packed objects. Additionally, the model may have difficulty detecting objects with low contrast.

Through its real-time object detection capabilities and versatility, YOLOv8 offers a robust solution for applications that might demand accurate and rapid identification of objects.

Sample use case:

In retail, YOLOv8 has the potential to power advanced shelf-monitoring systems. By analyzing camera feeds within stores, the model can help monitor shelf stock levels and detect misplaced or out-of-stock items. This can help aid stores in maintaining optimal inventory levels and improving customer experiences.


Google’s MobileNetV2 is a highly efficient CNN designed for mobile and embedded devices. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. Its main goal is to balance model accuracy and computational efficiency, making it suitable for situations where resource limitations are a factor.

MobileNetV2 is well-suited for a wide range of computer vision tasks, including image classification, object detection, and semantic segmentation. MobileNetV2 maintains impressive accuracy on various benchmark datasets with significantly fewer parameters and computational resources compared to larger, more complex models.

While MobileNetV2’s strength lies in its efficiency, it may not achieve the same performance as larger and more computationally intensive models on certain tasks. It can lead to reduced accuracy in complex scenarios where details are important.

Suitable for a wide range of applications, the model’s efficiency and performance makes it valuable in scenarios where resources are limited.

Sample use case:

In the media and entertainment industry, MobileNetV2 could be used to enhance augmented reality (AR) experiences on mobile devices. It can help accurately and efficiently detect objects in the real-world environment, allowing AR apps to integrate virtual elements with the environment.


EfficientNet-B5 is a small and fast CNN for image classification tasks. The “B5” variant represents a specific scale of the EfficientNet architecture, balancing model, depth, width, and resolution to optimize both accuracy and computational efficiency.

EfficientNet-B5 exhibits high-accuracy and strong performance across various image classification tasks. It can perform well while maintaining relatively lower computational demands compared to larger models. The balance allows for the model to excel in scenarios that require accurate predictions while conserving resources.

It’s important to consider that EfficientNet-B5 might still have high computational requirements and might not be the suitable choice for tasks requiring real-time processing on low-power devices due to its complexity.

EfficientNet-B5’s blend of accuracy and efficiency makes it an attractive choice for tasks demanding high-quality image analysis. Its versatility can be applied to a variety of industries, catering to use cases where precise predictions and manageable computational demands are present.

Sample use case:

In the healthcare domain, EfficientNet-B5 can be applied to medical image analysis. For example, it could be used to assist in diagnosing diseases from x-rays or MRIs, aiding medical professionals in making timely and accurate assessments or diagnoses.


The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformer. It is an open-vocabulary object detection network trained on a variety of image, text pairs. When presented an image alongside one or more text queries, it identifies objects within the image that correspond with the queries. Unlike conventional object detection models, OWL-ViT does not rely on training using labeled object datasets. Instead, it harnesses multi-modal representations to perform detection.

OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. OWL-ViT’s architecture enhancements contribute to improved accuracy and efficiency in computer vision tasks. By introducing context-rich positional embeddings, it achieves competitive results on object detection tasks.

However, OWl-ViT may not perform as well on very large datasets and can be influenced by the specifics of the task. Further training might be needed to tune its adaptations on specific tasks for an optimal result.

Overall, OWL-ViT’s customized adaptations for visual tasks make it a valuable asset across various industries.

Sample use case:

In the media and entertainment industry, where vast amounts of visual content are generated and shared, OWL-ViT can be employed for content analysis and moderation. For example, it can be used for video summarization, scene recognition, and content moderation, enabling platforms to better categorize and manage vast amounts of visual content.


BLIP-2 is a vision language model (VLM) developed by Salesforce Research. It enables language models to understand images while keeping their parameters frozen, making them more compute-efficient than previous multimodal pre-training methods.

BLIP-2 is an effective and efficient modal that can be applied to image understanding in numerous scenarios, especially when examples may be scarce. It can also be used for conditional text generation given an image and an optional prompt, and it can generate captions or answer questions about images.

Limitations include BLIP-2 having inaccurate knowledge from the language model or not having up-to-date information about new image content. Additionally, BLIP-2 inherits the risks associated with language models, such as outputting offensive language or propagating social bias.

BLIP-2’s versatile image-to-text generation capabilities can be used across a variety of industries.

Sample use case:

In retail, BLIP-2 can be used for automatic product tagging and image indexing. Given an image of a product, BLIP-2 can generate relevant tags and descriptions for the product, making it easier for the business to categorize and search for products.

AI builders are using all five of these foundation models to power a variety of computer vision AI applications today. With Model Foundry, you’ll soon be able to explore and experiment with a variety of computer vision models, evaluate performance on your data, and leverage the best one for computer vision tasks. Be sure to join the waitlist for its beta release, coming soon!


BLIP-2: Scalable Multimodal Pre-training Method. (2023, March 17). Salesforce AI. https://blog.salesforceairesearch.com/blip-2/

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. ArXiv:2005.12872 [Cs]. https://arxiv.org/abs/2005.12872

Chin Poo Lee, Kian Ming Lim, Song, Y., & Alqahtani, A. (2023). Plant-CNN-ViT: Plant Classification with Ensemble of Convolutional Neural Networks and Vision Transformer. Plants, 12(14), 2642–2642. https://doi.org/10.3390/plants12142642

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv:2010.11929 [Cs]. https://arxiv.org/abs/2010.11929

EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling. (n.d.). Google AI Blog. https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., & Houlsby, N. (2022). Simple Open-Vocabulary Object Detection with Vision Transformers. ArXiv:2205.06230 [Cs]. https://arxiv.org/abs/2205.06230

Paul, S., & Chen, P.-Y. (2021). Vision Transformers are Robust Learners. ArXiv:2105.07581 [Cs]. https://arxiv.org/abs/2105.07581

Sandler, M., & Howard, A. (2018, April 3). MobileNetV2: The Next Generation of On-Device Computer Vision Networks. Google AI Blog. https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr.2018.00474

Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ArXiv.org. https://arxiv.org/abs/1905.11946Ultralytics | Revolutionizing the World of Vision AI. (n.d.). Ultralytics. https://ultralytics.com/yolov8