logo

A list of foundation models to get the most out of your data

A list of foundation models to get the most out of your data

In past years, building and deploying a successful AI application involved extensive engineering. Teams spent exorbitant amounts of time and effort in collecting, cleaning, and labeling data, and iteration often took weeks.


With the advent of powerful, off-the-shelf foundation models, AI builders entered a new era of work. Rather than developing custom models from scratch, we can now simply fine-tune a powerful, off-the-shelf model for their use case and business requirements.


In this article, we'll explore a list of foundation models available to AI builders, along with their intended use cases, limitations, and possible real-world applications. With Labelbox Foundry, you’ll be able to explore, experiment, and leverage the list of foundation models detailed below to pre-label data and accelerate your downstream machine learning workflow.


A list of computer vision foundation models

Computer vision foundation models excel in various tasks such as image classification, object detection, and much more. The following list includes some of the most popular models that will help streamline AI development.


Vision Transformer (ViT) 

The Vision Transformer (ViT) model was first proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. This foundation model captures long-range dependencies in images, making it particularly effective in scenarios where context and global information are crucial. However, it may also require significant amounts of training data and computational resources to achieve optimal performance. 


YOLOv8

Ultralytic’s YOLOv8 (You Only Look Once version 8) is a state-of-the-art object detection foundation model designed to accurately identify objects within images or video frames. 


Through its real-time object detection capabilities and versatility, YOLOv8 offers a robust solution for applications that might demand accurate and rapid identification of objects. However, YOLOv8’s performance can falter when it comes to detecting small or densely packed objects. Additionally, the model may have difficulty detecting objects with low contrast.


MobileNetV2

Google’s MobileNetV2 is a highly efficient CNN designed for mobile and embedded devices. This foundation model is well-suited for a wide range of computer vision tasks, including image classification, object detection, and semantic segmentation. However, it can lead to reduced accuracy in complex scenarios where details are important.


EfficientNet

EfficientNet-B5 is a small and fast Convolutional Neural Network (CNN) for image classification tasks. This foundation model's blend of accuracy and efficiency makes it an attractive choice for tasks that demand high-quality image analysis. 


Unfortunately, this model has high computational requirements and might not be a suitable choice for tasks requiring real-time processing on low-power devices due to its complexity.


OWL-ViT

The OWL-ViT (Vision Transformer for Open-World Localization) is an open-vocabulary object detection network trained on a variety of image, text pairs. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. By introducing context-rich positional embeddings, it achieves competitive results on object detection tasks.


However, OWl-ViT may not perform as well on large datasets and can be influenced by the specifics of the task. Further training might be needed to tune its adaptations on specific tasks for an optimal result.


BLIP-2

BLIP-2 is an effective and efficient foundation model developed by Salesforce Research that can be applied to image understanding in numerous scenarios, especially when examples may be scarce.


This model may not have up-to-date information about new image content and inherits the risks associated with language models, such as outputting offensive language or propagating social bias.


Amazon Rekognition

Amazon Rekognition's object detection model is primarily used for detecting objects, scenes, activities, landmarks, faces, dominant colors, and image quality in images and videos. The model is reported to have high accuracy in detecting objects and scenes in images and videos.  However, it does have detection issues when it comes to the complexity of image content.


A list of large language models (LLMs)

Large language models (LLMs) have revolutionized the field of AI and natural language processing (NLP). The following are a few examples of LLM foundation models that can help accelerate model development.


GPT-4

GPT-4 is optimized for conversational interfaces and can be used to generate text summaries, reports, and responses. As a multimodal model, GPT-4 can accept both text and image outputs; however, it still retains the downsides of a traditional LLM model such as being prone to LLM hallucination and reasoning errors.


PaLM 2

PaLM 2 is built for language understanding, language generation, and conversations. This foundation model is fine-tuned to conduct natural multi-turn conversations and excels at tasks such as advanced reasoning, translation, and code generation. However, the limitations of this model are its potential for toxicity and bias — researchers found that of all the times PaLM 2 responded with prompts incorrectly, 38.2% of the time it “reinforced a harmful social bias.”


BLOOM

BLOOM is the world's largest open multilingual language model with 176 billion parameters and can be used for text generation, multilingual tasks, or for language understanding models and applications. However, the performance of the model varies very heavily on the prompt and it will not perform well if the input is not clear.


Claude Instant

Anthropic’s Claude Instant is a low-latency, high-throughput LLM trained to perform various conversational and text-processing tasks. This foundation model can answer questions, give recommendations, and have casual conversations using natural language. However, it does not possess the general capability for open-domain problem solving.


T-5

T-5 (Text-to-Text Transfer Transformer) is Google AI’s versatile language model and is designed to convert all language tasks into a unified text-to-text format, where all NLP tasks are reframed as input-output text strings. Unfortunately, some limitations of the T-5 model include inputs being limited to 512 tokens and T-5’s variable performance depending on the specific task.


Cohere Command

Command is Cohere's text generation model and can be used for interactive autocomplete, augmenting human writing processes, summarization, text rephrasing, and other text-to-text tasks in non-sensitive domains. Unfortunately, as an LLM, this foundation model is prone to inherent problems such as LLM hallucination and bias, along with being limited to the knowledge that it has been trained on.


Final thoughts on using foundation models to get the most out of your data

Foundation models have changed the way AI builders and business users build AI applications. Using these cutting-edge foundation models, AI builders can accelerate the development of common enterprise AI applications at scale across a variety of industries.


Learn more about how you can leverage the foundation models above for AI development quickly and easily with Labelbox Foundry. Explore and experiment with a variety of foundation models, evaluate performance of your data, and leverage the best one for your use case.