Grok
Grok is a general purpose model that can be used for a variety of tasks, including generating and understanding text, code, and function calling.
Intended Use
Text and code: Generate code, extract data, prepare summaries and more.
Vision: Identify objects, analyze visuals, extract text from documents and more.
Function calling: Connect Grok to external tools and services for enriched interactions.
Performance
Limitations
Context: Grok may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As Grok trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, Grok may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Grok can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Grok might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output of Grok are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://x.ai/blog/grok-2
Claude 3.5 Haiku
Claude 3.5 Haiku, the next generation of Anthropic's fastest and most cost-effective model, is optimal for use cases where speed and affordability matter. It improves on its predecessor across every skill set.
Intended Use
Claude 3.5 Haiku offers fast speeds, improved instruction-following, and accurate tool use, making it ideal for user-facing products and personalized experiences.
Key use cases include code completion, streamlining development workflows with quick, accurate code suggestions. It powers interactive chatbots for customer service, e-commerce, and education, handling high user interaction volumes. It excels at data extraction and labeling, processing large datasets in sectors like finance and healthcare. Additionally, it provides real-time content moderation for safe online environments.
Performance
Limitations
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://docs.anthropic.com/claude/docs/models-overview
Llama 3.2
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
Intended Use
Llama 3.2's 11B and 90B models support image reasoning, enabling tasks like understanding charts/graphs, captioning images, and pinpointing objects based on language descriptions. For example, the models can answer questions about sales trends from a graph or trail details from a map. They bridge vision and language by extracting image details, understanding the scene, and generating descriptive captions to tell the story, making them powerful for both visual and textual reasoning tasks.
Llama 3.2's 1B and 3B models support multilingual text generation and on-device applications with strong privacy. Developers can create personalized, agentic apps where data stays local, enabling tasks like summarizing messages, extracting action items, and sending calendar invites for follow-up meetings.
Performance
Llama 3.2 vision models outperform competitors like Gemma 2.6B and Phi 3.5-mini in tasks like image recognition, instruction-following, and summarization.
Limitations
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
Bring your own model
Register your own custom models in Labelbox to pre-label or enrich datasetsOpenAI o1-mini
The o1 series of large language models are designed to perform advanced reasoning through reinforcement learning. These models engage in deep internal thought processes before delivering responses, enabling them to handle complex queries. o1-mini is a small specialized model optimized for STEM-related reasoning during its pretraining. Despite its reduced size, the o1-mini undergoes the same high-compute reinforcement learning pipeline as the larger o1 models, achieving comparable performance on many reasoning tasks while being significantly more cost-efficient.
Performance
Human raters compared o1-mini to GPT-4o on challenging, open-ended prompts across various domains to assess performance and accuracy in different types of tasks. As seen from the graph above, o-1 mini is optimized for STEM related tasks.
Limitations
Optimization for STEM Knowledge: o1-mini is not optimized for tasks requiring non-STEM factual knowledge, which may result in less accurate responses when handling queries outside of technical or scientific domains.
Domain Preference: o1-mini is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains, where linguistic nuance and fluency are more critical.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
OpenAI O1 Preview
The o1 series of large language models are designed to perform advanced reasoning through reinforcement learning. These models engage in deep internal thought processes before delivering responses, enabling them to handle complex queries. o1 preview is designed to reason about hard problems using broad general knowledge about the world.
Performance
Human raters compared o1-preview to GPT-4o on challenging, open-ended prompts across various domains to assess performance and accuracy in different types of tasks. As seen from the graph above, o-1 preview is optimized for STEM related tasks.
Limitations
Optimization for STEM Knowledge: o1-preview is not optimized for tasks requiring non-STEM factual knowledge, which may result in less accurate responses when handling queries outside of technical or scientific domains.
Domain Preference: o1-preview is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains, where linguistic nuance and fluency are more critical.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://openai.com/index/introducing-openai-o1-preview/
OpenAI GPT-4o
GPT-4o (“o” for “omni”) is the most advanced OpenAI model. It is multimodal (accepting text or image inputs and outputting text), and it has the same high intelligence as GPT-4 Turbo but is much more efficient—it generates text 2x faster and is 50% cheaper. Additionally, GPT-4o has the best vision and performance across non-English languages of any OpenAI model.
Performance
As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.
Limitations
Accuracy: While GPT-4o can provide detailed and accurate responses, it may occasionally generate incorrect or nonsensical answers, particularly for highly specialized or obscure queries.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://platform.openai.com/docs/models/gpt-4o
Google Gemini 1.5 Flash
Google introduces Gemini 1.5 Flash that is faster and cheaper than Gemini 1.5 Pro. Gemini 1.5 Flash also has a context window of up to 1 million tokens, it can process vast amounts of information, including:
1 hour of video
11 hours of audio
over 700,000 words
This substantial capacity makes Gemini 1.5 Flash particularly well-suited for real-time and context-intensive applications, enabling seamless processing of large volumes of information.
Intended Use
Rapid text generation and completion
Real-time conversation and chatbot applications
Quick information retrieval and summarization
Multimodal understanding (text, images, audio, video)
Code generation and analysis
Task planning and execution
Limitations
May sacrifice some accuracy for speed compared to larger models
Performance on highly specialized or technical tasks may vary
Could exhibit biases present in its training data
Limited by the knowledge cutoff of its training data
Cannot access real-time information or browse the internet
May struggle with tasks requiring deep logical reasoning or complex mathematical computations
Citation
https://deepmind.google/technologies/gemini/flash/
Bring your own model
Register your own custom models in Labelbox to pre-label or enrich datasetsBLIP2 (blip2-flan-t5-xxl)
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. This model is the BLIP-2, Flan-T5-XXL variant.
Intended Use
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. It can also be used for chat-like conversations by feeding the image and the previous conversation as prompt to the model.
Performance
Best performance within the BLIP2 family of models.
Limitations
BLIP2 is fine-tuned on image-text datasets (e.g. LAION) collected from the internet. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data. Other limitations include:
May struggle with highly abstract or culturally specific visual concepts
Performance can vary based on image quality and complexity
Limited by the training data of its component models (vision encoder and language model)
Cannot generate or edit images (only processes and describes them)
Requires careful prompt engineering for optimal performance in some tasks
Citation
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
Amazon Textract
Use this optical character recognition (OCR) model to extract text from images. The model will take images as input and generate text annotations, classifying them within bounding boxes. The bounding boxes will be grouped by words.
Intended Use
Amazon Textract extracts text, handwriting, and structured data from scanned documents, including forms and tables, surpassing basic OCR capabilities. It provides extracted data with bounding box coordinates and confidence scores to help users accurately assess and utilize the information.
Performance
Custom Queries: Amazon Textract allows customization of its pretrained Queries feature to enhance accuracy for specific document types while retaining data control. Users can upload and annotate a minimum of ten sample documents through the AWS Console to tailor the Queries feature within hours.
Layout: Amazon Textract extracts various layout elements from documents, including paragraphs, titles, and headers, via the Analyze Document API. This feature can be used independently or in conjunction with other document analysis features.
Optical Character Recognition (OCR): Textract’s OCR detects both printed and handwritten text from documents and images, handling various fonts, styles, and text distortions through machine learning. It is capable of recognizing text in noisy or distorted conditions.
Form Extraction: Textract identifies and retains key-value pairs from documents automatically, preserving their context for easier database integration. Unlike traditional OCR, it maintains the relationship between keys and values without needing custom rules.
Table Extraction: The service extracts and maintains the structure of tabular data in documents, such as financial reports or medical records, allowing for easy import into databases. Data in rows and columns, like inventory reports, is preserved for accurate application.
Signature Detection: Textract detects signatures on various documents and images, including checks and loan forms, and provides the location and confidence scores of these signatures in the API response.
Query-Based Extraction: Textract enables data extraction using natural language queries, eliminating the need to understand document structure or format variations. It’s pre-trained on a diverse set of documents, reducing post-processing and manual review needs.
Analyze Lending: The Analyze Lending API automates the extraction and classification of information from mortgage loan documents. It uses preconfigured machine learning models to organize and process loan packages upon upload.
Invoices and Receipts: Textract leverages machine learning to extract key data from invoices and receipts, such as vendor names, item prices, and payment terms, despite varied layouts. This reduces the complexity of manual data extraction.
Identity Documents: Textract uses ML to extract and understand details from identity documents like passports and driver’s licenses, including implied information. This facilitates automated processes in ID verification, account creation, and more without template reliance.
Limitations
May struggle with highly stylized fonts or severe document degradation
Handwriting recognition accuracy can vary based on writing style
Performance may decrease with complex, multi-column layouts
Limited ability to understand document context or interpret extracted data
May have difficulty with non-Latin scripts or specialized notation
Citation
https://docs.aws.amazon.com/textract/
Claude 3 Haiku
Claude 3 Haiku stands out as the fastest and most cost-effective model in its class. Boasting cutting-edge vision capabilities and exceptional performance on industry benchmarks, Haiku offers a versatile solution for a broad spectrum of enterprise applications.
Intended Use
Performance
Near-instant results: The Claude 3 models excel in powering real-time tasks such as live customer chats, auto-completions, and data extraction. Haiku, the fastest and most cost-effective model, and Sonnet, which is twice as fast as Claude 2 and 2.1, both offer superior intelligence and performance for a variety of demanding applications.
Vision and Image Processing: This model can process and analyze visual input, extracting insights from documents, processing web UI, generating image catalog metadata, and more.
Long context and near-perfect recall: Haiku offers a 200K context window and can process inputs exceeding 1 million tokens, with Claude 3 Opus achieving near-perfect recall surpassing 99% accuracy in the "Needle In A Haystack".
Limitations
Medical images: Claude 3 Haiku is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: Claude 3 Haiku may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for Claude 3.5, but avoid cropping important details.
Rotation: Claude 3 Haiku may misinterpret rotated / upside-down text or images.
Visual elements: Claude 3 Haiku may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: Claude 3 Haiku struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Hallucinations: The model can provide factually inaccurate information.
Citation
https://docs.anthropic.com/claude/docs/models-overview
Amazon Rekognition
Common objects detection and image classification model by AWS Rekognition.
Intended Use
Amazon Rekognition's object detection model is primarily used for detecting objects, scenes, activities, landmarks, faces, dominant colors, and image quality in images and videos. Some common use cases include:
Detect and label common objects in images
Identify activities and scenes in visual content
Enable content moderation and filtering
Enhance image search capabilities
Performance
Amazon Rekognition's object detection model has been reported to have high accuracy ind detecting objects and scenes in images and videos. Its capabilities include:
Can detect thousands of object categories
Provides bounding boxes for object locations
Assigns confidence scores to detections
Limitations
the performance of the model may be limited by factors such as the quality and quantity of training data, the complexity of the image content, or the accuracy of the annotations. Additionally, Amazon Rekognition may have detection issues with black and white images and elderly people.
Other limitations include:
May struggle with small or partially obscured objects
Performance can vary based on image quality and lighting
Limited ability to understand context or relationships between objects
Cannot identify specific individuals (separate face recognition API for that)
May have biases in detection rates across different demographics
Citation
Llama 3.1 405B
Llama 3.1 builds upon the success of its predecessors, offering enhanced performance, improved safety measures, and greater flexibility for researchers and developers. It demonstrates exceptional proficiency in language understanding, generation, and reasoning tasks, making it a powerful tool for a wide range of applications. It is one of the most powerful open source AI models, which you can fine-tune, distill and deploy anywhere. The latest instruction-tuned model is available in 8B, 70B and 405B versions.
Intended Use
Research and Development: Ideal for exploring cutting-edge AI research, developing new model architectures, and fine-tuning for specific tasks.
Open-Source Community: Designed to foster collaboration and accelerate innovation in the open-source AI community.
Education and Experimentation: A valuable resource for students and researchers to learn about and experiment with state-of-the-art LLM technology.
Performance
Enhanced Performance: Llama 3.1 boasts improvements in various benchmarks, including language modeling, question answering, and text summarization.
Improved Safety: The model has undergone rigorous safety training to reduce the risk of generating harmful or biased outputs.
Increased Flexibility: Llama 3.1 is available in multiple sizes, allowing users to choose the model that best suits their compute resources and specific needs.
Limitations
Data Freshness: The pretraining data has a cutoff of December 2023.
Citations
Claude 3.5 Sonnet
Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval). It shows marked improvement in grasping nuance, humor, and complex instructions, and is exceptional at writing high-quality content with a natural, relatable tone. Claude 3.5 Sonnet operates at twice the speed of Claude 3 Opus. This performance boost, combined with cost-effective pricing, makes Claude 3.5 Sonnet ideal for complex tasks such as context-sensitive customer support and orchestrating multi-step workflows.
Intended Use
Task automation: plan and execute complex actions across APIs and databases, interactive coding
R&D: research review, brainstorming and hypothesis generation, drug discovery
Strategy: advanced analysis of charts & graphs, financials and market trends, forecasting
Performance
Advanced Coding ability: In an internal evaluation by Anthropic, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%.
Multilingual Capabilities: Claude 3.5 Sonnet offers improved fluency in non-English languages such as Spanish and Japanese, enabling use cases like translation services and global content creation.
Vision and Image Processing: This model can process and analyze visual input, extracting insights from documents, processing web UI, generating image catalog metadata, and more.
Steerability and Ease of Use: Claude 3.5 Sonnet is designed to be easy to steer and better at following directions, giving you more control over model behavior and more predictable, higher-quality outputs.
Limitations
Here are some of the limitations we are aware of:
Medical images: Claude 3.5 is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: Claude 3.5 may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for Claude 3.5, but avoid cropping important details.
Rotation: Claude 3.5 may misinterpret rotated / upside-down text or images.
Visual elements: Claude 3.5 may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: Claude 3.5 struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Hallucinations: the model can provide factually inaccurate information.
Image shape: Claude 3.5 struggles with panoramic and fisheye images.
Metadata and resizing: Claude 3.5 doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.
Counting: Claude 3.5 may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, Claude 3.5 has a system to block the submission of CAPTCHAs.
Citation
Grok
Grok is a general purpose model that can be used for a variety of tasks, including generating and understanding text, code, and function calling.
Intended Use
Text and code: Generate code, extract data, prepare summaries and more.
Vision: Identify objects, analyze visuals, extract text from documents and more.
Function calling: Connect Grok to external tools and services for enriched interactions.
Performance
Limitations
Context: Grok may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As Grok trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, Grok may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Grok can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Grok might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output of Grok are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://x.ai/blog/grok-2
Claude 3.5 Haiku
Claude 3.5 Haiku, the next generation of Anthropic's fastest and most cost-effective model, is optimal for use cases where speed and affordability matter. It improves on its predecessor across every skill set.
Intended Use
Claude 3.5 Haiku offers fast speeds, improved instruction-following, and accurate tool use, making it ideal for user-facing products and personalized experiences.
Key use cases include code completion, streamlining development workflows with quick, accurate code suggestions. It powers interactive chatbots for customer service, e-commerce, and education, handling high user interaction volumes. It excels at data extraction and labeling, processing large datasets in sectors like finance and healthcare. Additionally, it provides real-time content moderation for safe online environments.
Performance
Limitations
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://docs.anthropic.com/claude/docs/models-overview
Llama 3.2
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
Intended Use
Llama 3.2's 11B and 90B models support image reasoning, enabling tasks like understanding charts/graphs, captioning images, and pinpointing objects based on language descriptions. For example, the models can answer questions about sales trends from a graph or trail details from a map. They bridge vision and language by extracting image details, understanding the scene, and generating descriptive captions to tell the story, making them powerful for both visual and textual reasoning tasks.
Llama 3.2's 1B and 3B models support multilingual text generation and on-device applications with strong privacy. Developers can create personalized, agentic apps where data stays local, enabling tasks like summarizing messages, extracting action items, and sending calendar invites for follow-up meetings.
Performance
Llama 3.2 vision models outperform competitors like Gemma 2.6B and Phi 3.5-mini in tasks like image recognition, instruction-following, and summarization.
Limitations
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
Bring your own model
Register your own custom models in Labelbox to pre-label or enrich datasetsOpenAI o1-mini
The o1 series of large language models are designed to perform advanced reasoning through reinforcement learning. These models engage in deep internal thought processes before delivering responses, enabling them to handle complex queries. o1-mini is a small specialized model optimized for STEM-related reasoning during its pretraining. Despite its reduced size, the o1-mini undergoes the same high-compute reinforcement learning pipeline as the larger o1 models, achieving comparable performance on many reasoning tasks while being significantly more cost-efficient.
Performance
Human raters compared o1-mini to GPT-4o on challenging, open-ended prompts across various domains to assess performance and accuracy in different types of tasks. As seen from the graph above, o-1 mini is optimized for STEM related tasks.
Limitations
Optimization for STEM Knowledge: o1-mini is not optimized for tasks requiring non-STEM factual knowledge, which may result in less accurate responses when handling queries outside of technical or scientific domains.
Domain Preference: o1-mini is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains, where linguistic nuance and fluency are more critical.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
OpenAI O1 Preview
The o1 series of large language models are designed to perform advanced reasoning through reinforcement learning. These models engage in deep internal thought processes before delivering responses, enabling them to handle complex queries. o1 preview is designed to reason about hard problems using broad general knowledge about the world.
Performance
Human raters compared o1-preview to GPT-4o on challenging, open-ended prompts across various domains to assess performance and accuracy in different types of tasks. As seen from the graph above, o-1 preview is optimized for STEM related tasks.
Limitations
Optimization for STEM Knowledge: o1-preview is not optimized for tasks requiring non-STEM factual knowledge, which may result in less accurate responses when handling queries outside of technical or scientific domains.
Domain Preference: o1-preview is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains, where linguistic nuance and fluency are more critical.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://openai.com/index/introducing-openai-o1-preview/
OpenAI GPT-4o
GPT-4o (“o” for “omni”) is the most advanced OpenAI model. It is multimodal (accepting text or image inputs and outputting text), and it has the same high intelligence as GPT-4 Turbo but is much more efficient—it generates text 2x faster and is 50% cheaper. Additionally, GPT-4o has the best vision and performance across non-English languages of any OpenAI model.
Performance
As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.
Limitations
Accuracy: While GPT-4o can provide detailed and accurate responses, it may occasionally generate incorrect or nonsensical answers, particularly for highly specialized or obscure queries.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://platform.openai.com/docs/models/gpt-4o
Google Gemini 1.5 Flash
Google introduces Gemini 1.5 Flash that is faster and cheaper than Gemini 1.5 Pro. Gemini 1.5 Flash also has a context window of up to 1 million tokens, it can process vast amounts of information, including:
1 hour of video
11 hours of audio
over 700,000 words
This substantial capacity makes Gemini 1.5 Flash particularly well-suited for real-time and context-intensive applications, enabling seamless processing of large volumes of information.
Intended Use
Rapid text generation and completion
Real-time conversation and chatbot applications
Quick information retrieval and summarization
Multimodal understanding (text, images, audio, video)
Code generation and analysis
Task planning and execution
Limitations
May sacrifice some accuracy for speed compared to larger models
Performance on highly specialized or technical tasks may vary
Could exhibit biases present in its training data
Limited by the knowledge cutoff of its training data
Cannot access real-time information or browse the internet
May struggle with tasks requiring deep logical reasoning or complex mathematical computations
Citation
https://deepmind.google/technologies/gemini/flash/
Bring your own model
Register your own custom models in Labelbox to pre-label or enrich datasetsBLIP2 (blip2-flan-t5-xxl)
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. This model is the BLIP-2, Flan-T5-XXL variant.
Intended Use
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. It can also be used for chat-like conversations by feeding the image and the previous conversation as prompt to the model.
Performance
Best performance within the BLIP2 family of models.
Limitations
BLIP2 is fine-tuned on image-text datasets (e.g. LAION) collected from the internet. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data. Other limitations include:
May struggle with highly abstract or culturally specific visual concepts
Performance can vary based on image quality and complexity
Limited by the training data of its component models (vision encoder and language model)
Cannot generate or edit images (only processes and describes them)
Requires careful prompt engineering for optimal performance in some tasks
Citation
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
Amazon Textract
Use this optical character recognition (OCR) model to extract text from images. The model will take images as input and generate text annotations, classifying them within bounding boxes. The bounding boxes will be grouped by words.
Intended Use
Amazon Textract extracts text, handwriting, and structured data from scanned documents, including forms and tables, surpassing basic OCR capabilities. It provides extracted data with bounding box coordinates and confidence scores to help users accurately assess and utilize the information.
Performance
Custom Queries: Amazon Textract allows customization of its pretrained Queries feature to enhance accuracy for specific document types while retaining data control. Users can upload and annotate a minimum of ten sample documents through the AWS Console to tailor the Queries feature within hours.
Layout: Amazon Textract extracts various layout elements from documents, including paragraphs, titles, and headers, via the Analyze Document API. This feature can be used independently or in conjunction with other document analysis features.
Optical Character Recognition (OCR): Textract’s OCR detects both printed and handwritten text from documents and images, handling various fonts, styles, and text distortions through machine learning. It is capable of recognizing text in noisy or distorted conditions.
Form Extraction: Textract identifies and retains key-value pairs from documents automatically, preserving their context for easier database integration. Unlike traditional OCR, it maintains the relationship between keys and values without needing custom rules.
Table Extraction: The service extracts and maintains the structure of tabular data in documents, such as financial reports or medical records, allowing for easy import into databases. Data in rows and columns, like inventory reports, is preserved for accurate application.
Signature Detection: Textract detects signatures on various documents and images, including checks and loan forms, and provides the location and confidence scores of these signatures in the API response.
Query-Based Extraction: Textract enables data extraction using natural language queries, eliminating the need to understand document structure or format variations. It’s pre-trained on a diverse set of documents, reducing post-processing and manual review needs.
Analyze Lending: The Analyze Lending API automates the extraction and classification of information from mortgage loan documents. It uses preconfigured machine learning models to organize and process loan packages upon upload.
Invoices and Receipts: Textract leverages machine learning to extract key data from invoices and receipts, such as vendor names, item prices, and payment terms, despite varied layouts. This reduces the complexity of manual data extraction.
Identity Documents: Textract uses ML to extract and understand details from identity documents like passports and driver’s licenses, including implied information. This facilitates automated processes in ID verification, account creation, and more without template reliance.
Limitations
May struggle with highly stylized fonts or severe document degradation
Handwriting recognition accuracy can vary based on writing style
Performance may decrease with complex, multi-column layouts
Limited ability to understand document context or interpret extracted data
May have difficulty with non-Latin scripts or specialized notation
Citation
https://docs.aws.amazon.com/textract/
Claude 3 Haiku
Claude 3 Haiku stands out as the fastest and most cost-effective model in its class. Boasting cutting-edge vision capabilities and exceptional performance on industry benchmarks, Haiku offers a versatile solution for a broad spectrum of enterprise applications.
Intended Use
Performance
Near-instant results: The Claude 3 models excel in powering real-time tasks such as live customer chats, auto-completions, and data extraction. Haiku, the fastest and most cost-effective model, and Sonnet, which is twice as fast as Claude 2 and 2.1, both offer superior intelligence and performance for a variety of demanding applications.
Vision and Image Processing: This model can process and analyze visual input, extracting insights from documents, processing web UI, generating image catalog metadata, and more.
Long context and near-perfect recall: Haiku offers a 200K context window and can process inputs exceeding 1 million tokens, with Claude 3 Opus achieving near-perfect recall surpassing 99% accuracy in the "Needle In A Haystack".
Limitations
Medical images: Claude 3 Haiku is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: Claude 3 Haiku may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for Claude 3.5, but avoid cropping important details.
Rotation: Claude 3 Haiku may misinterpret rotated / upside-down text or images.
Visual elements: Claude 3 Haiku may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: Claude 3 Haiku struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Hallucinations: The model can provide factually inaccurate information.
Citation
https://docs.anthropic.com/claude/docs/models-overview
Amazon Rekognition
Common objects detection and image classification model by AWS Rekognition.
Intended Use
Amazon Rekognition's object detection model is primarily used for detecting objects, scenes, activities, landmarks, faces, dominant colors, and image quality in images and videos. Some common use cases include:
Detect and label common objects in images
Identify activities and scenes in visual content
Enable content moderation and filtering
Enhance image search capabilities
Performance
Amazon Rekognition's object detection model has been reported to have high accuracy ind detecting objects and scenes in images and videos. Its capabilities include:
Can detect thousands of object categories
Provides bounding boxes for object locations
Assigns confidence scores to detections
Limitations
the performance of the model may be limited by factors such as the quality and quantity of training data, the complexity of the image content, or the accuracy of the annotations. Additionally, Amazon Rekognition may have detection issues with black and white images and elderly people.
Other limitations include:
May struggle with small or partially obscured objects
Performance can vary based on image quality and lighting
Limited ability to understand context or relationships between objects
Cannot identify specific individuals (separate face recognition API for that)
May have biases in detection rates across different demographics
Citation
Llama 3.1 405B
Llama 3.1 builds upon the success of its predecessors, offering enhanced performance, improved safety measures, and greater flexibility for researchers and developers. It demonstrates exceptional proficiency in language understanding, generation, and reasoning tasks, making it a powerful tool for a wide range of applications. It is one of the most powerful open source AI models, which you can fine-tune, distill and deploy anywhere. The latest instruction-tuned model is available in 8B, 70B and 405B versions.
Intended Use
Research and Development: Ideal for exploring cutting-edge AI research, developing new model architectures, and fine-tuning for specific tasks.
Open-Source Community: Designed to foster collaboration and accelerate innovation in the open-source AI community.
Education and Experimentation: A valuable resource for students and researchers to learn about and experiment with state-of-the-art LLM technology.
Performance
Enhanced Performance: Llama 3.1 boasts improvements in various benchmarks, including language modeling, question answering, and text summarization.
Improved Safety: The model has undergone rigorous safety training to reduce the risk of generating harmful or biased outputs.
Increased Flexibility: Llama 3.1 is available in multiple sizes, allowing users to choose the model that best suits their compute resources and specific needs.
Limitations
Data Freshness: The pretraining data has a cutoff of December 2023.
Citations
Claude 3.5 Sonnet
Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval). It shows marked improvement in grasping nuance, humor, and complex instructions, and is exceptional at writing high-quality content with a natural, relatable tone. Claude 3.5 Sonnet operates at twice the speed of Claude 3 Opus. This performance boost, combined with cost-effective pricing, makes Claude 3.5 Sonnet ideal for complex tasks such as context-sensitive customer support and orchestrating multi-step workflows.
Intended Use
Task automation: plan and execute complex actions across APIs and databases, interactive coding
R&D: research review, brainstorming and hypothesis generation, drug discovery
Strategy: advanced analysis of charts & graphs, financials and market trends, forecasting
Performance
Advanced Coding ability: In an internal evaluation by Anthropic, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%.
Multilingual Capabilities: Claude 3.5 Sonnet offers improved fluency in non-English languages such as Spanish and Japanese, enabling use cases like translation services and global content creation.
Vision and Image Processing: This model can process and analyze visual input, extracting insights from documents, processing web UI, generating image catalog metadata, and more.
Steerability and Ease of Use: Claude 3.5 Sonnet is designed to be easy to steer and better at following directions, giving you more control over model behavior and more predictable, higher-quality outputs.
Limitations
Here are some of the limitations we are aware of:
Medical images: Claude 3.5 is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: Claude 3.5 may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for Claude 3.5, but avoid cropping important details.
Rotation: Claude 3.5 may misinterpret rotated / upside-down text or images.
Visual elements: Claude 3.5 may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: Claude 3.5 struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Hallucinations: the model can provide factually inaccurate information.
Image shape: Claude 3.5 struggles with panoramic and fisheye images.
Metadata and resizing: Claude 3.5 doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.
Counting: Claude 3.5 may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, Claude 3.5 has a system to block the submission of CAPTCHAs.
Citation
Let foundation models do the work
We are bringing together the world's best AI models to help you perform data labeling and enrichment tasks across all supported data modalities. Achieve breakthroughs in data generation speed and costs. Re-focus human efforts on quality assurance.
"We love our initial use of Model Foundry. Instead of going into unstructured text datasets blindly, we can now use pre-existing LLMs to pre-label data or pre-tag parts of it. Model Foundry serves as a co-pilot for training data."
- Alexander Booth, Assistant Director of the World Series Champions, Texas Rangers
Pre-label data in a few clicks
AI builders can now enrich datasets and pre-label data in minutes without code using foundation models offered by leading providers or open source alternatives. Model-assisted labeling using Foundry accelerates data labeling tasks on images, text, and documents at a fraction of the typical cost and speed.
Focus human expertise on where it matters most
The days of manually labeling from scratch are long gone. Foundation models can outperform crowd-sourced labeling on various tasks with high accuracy – allowing you to focus valuable human efforts on critical review. Combine model-assisted labeling using foundation models with human-in-the-loop review to accelerate your labeling operations and build intelligent AI faster than ever.
Automate data tasks with Foundry Apps
Create custom Apps with Foundry based on your team’s needs. Deploy a model configuration from Foundry to an App for automated data management or to build custom intelligent applications.
Explore datasets using AI insights
Curate and explore datasets faster than ever. Query your data using predictions generated from the latest and greatest foundation models. Supercharge your data enrichment process and accelerate curation efforts across image and text data modalities.
Accelerate custom model development
Data enriched by Foundry can be used to tailor foundation models to your specific use cases:
Fine-tune leading foundation models: Optimize foundation models for your most valuable AI use cases. Enhance performance for specialized tasks in just days instead of months.
Distill knowledge into smaller models: Elaborate foundation models into lightweight models purpose-built for your applications.
Frequently asked questions
Who gets access to the Model Foundry?
To use Labelbox Foundry, you will need to be a part of our Starter or Enterprise plans. You will be billed monthly for the pass-through compute costs associated with foundation models run on your data in Labelbox.
Learn more about how to upgrade to a pay-as-you-go Starter plan or about our Labelbox Unit pricing.
Contact us to receive more detailed instructions on how to create a Labelbox Starter or Enterprise account or upgrade to our Starter tier.
How is Foundry priced?
Only pay for the models that you want to pre-label or enrich your data with. Foundry pricing will be calculated and billed monthly based on the following:
Inference cost – Labelbox will charge customers for inference costs for all models hosted by Labelbox. Inference costs will be bespoke to each model available in Foundry. The inference price is determined based on vendors or our compute costs – these are published publicly on our website as well as inside the product.
Labelbox's platform cost – each asset with predictions generated by Foundry will accrue LBUs.
Learn more about the pricing of the Foundry add-on for Labelbox Model on our pricing page.
What kind of foundation models are available?
Labelbox Foundry currently supports a variety of tasks for computer vision and natural language processing. This includes powerful open-source and third-party models across text generation, object detection, translation, text classification, image segmentation, and more. For a full list of available models, please visit this page.
How do I get started with Foundry?
If you aren’t currently a Labelbox user or are on our Free plan, you’ll need to:
Create a Labelbox account
Upgrade your account to our Starter plan:
In the Billing tab, locate “Starter” in the All Plans list and select “Switch to Plan.” The credit card on file will only be charged when you exceed your existing free 10,000 LBU.
Upgrades take effect immediately so you'll have access to Foundry right away on the Starter plan. After upgrading, you’ll see the option to activate Foundry for your organization.
Follow the steps below on how to get started generating model predictions with Foundry.
If you are currently a Labelbox customer on our Starter or Enterprise plans, you automatically have the ability to opt-in to using Foundry:
After selecting data in Catalog and hitting “Predict with Foundry,” you’ll be able to select a foundation model and submit a model run to generate predictions. You can learn more about the Foundry workflow and see it in-action in this guide.
When submitting a model run, you’ll see the option to activate Foundry for your organization’s Admins. You’ll need to agree to Labelbox’s Model Foundry add-on service terms and confirm you understand the associated compute fees.
Does Labelbox sell or use customer data via Foundry?
Labelbox does not sell or use customer data in a way other than to provide you the services.
Do the third-party model providers use customer data to train their models?
Labelbox's third-party model providers will not use your data to train their models, as governed by specific 'opt-out' terms between Labelbox and our third-party model providers.
Can I use the closed source models available through Foundry for pre-labeling?
OpenAI: Labelbox believes that you have the right to use OpenAI’s closed-source models (e.g, GPT-4, GPT3.5, etc) for pre-labeling through Foundry if you comply with applicable laws and otherwise follow OpenAI’s usage policies. If you wish to use your own instance of OpenAI, you can do so by setting up a custom model on Foundry.
Anthropic: Labelbox believes that you have the right to use Anthropic’s Claude for pre-labeling through Foundry if you comply with applicable laws and otherwise follow Anthropic’s Acceptable Use Policy. If you wish to use your own instance of Anthropic, you can do so by setting up a custom model on Foundry.
Google: Labelbox believes that you have the right to use Google’s Gemini (or other closed-source models such as Imagen, PalLM, etc.) for pre-labeling through Foundry if you comply with applicable laws and otherwise follow the Gemini API Additional Terms of Service, including without limitation the Generative AI Prohibited Use Policy. If you wish to use your own instance of Google, you can do so by setting up a custom model on Foundry.