Models

Search...

NLP

Text generation

Question answering

Zero-shot classification

Summarization

Conversational

Text classification

Custom ontology

Translation

Computer vision

Image classification

Object detection

Image segmentation

Video segmentation

Visual question answering

Google Gemini 2.5 Pro

Gemini 2.5 is a thinking model, designed to tackle increasingly complex problems. Gemini 2.5 Pro Experimental, leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities. Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.

Intended Use

Multimodal input
Text output
Prompt optimizers
Controlled generation
Function calling (excluding compositional function calling)
Grounding with Google Search
Code execution
Count token

Performance

Google’s latest AI model, Gemini 2.5 Pro, represents a major leap in AI performance and reasoning capabilities. Positioned as the most advanced iteration in the Gemini lineup, this experimental release of 2.5 Pro is now the top performer on the LMArena leaderboard, surpassing other models by a notable margin in human preference evaluations.

Gemini 2.5 builds on Google’s prior efforts to enhance reasoning in AI, incorporating advanced techniques like reinforcement learning and chain-of-thought prompting. This version introduces a significantly upgraded base model paired with improved post-training, resulting in better contextual understanding and more accurate decision-making. The model is designed as a “thinking model,” capable of deeply analyzing information before responding — a capability now embedded across all future Gemini models.

The reasoning performance of 2.5 Pro stands out across key benchmarks such as GPQA and AIME 2025, even without cost-increasing test-time techniques. It also achieved a state-of-the-art 18.8% score on “Humanity’s Last Exam,” a benchmark crafted by experts to evaluate deep reasoning across disciplines.

In terms of coding, Gemini 2.5 Pro significantly outperforms its predecessors. It excels in creating complex, visually rich web apps and agentic applications. On SWE-Bench Verified, a standard for evaluating coding agents, the model scored an impressive 63.8% using a custom setup.

Additional features include a 1 million token context window, with plans to extend to 2 million, enabling the model to manage vast datasets and multimodal inputs — including text, images, audio, video, and code repositories.

Limitations

Context: Google Gemini 2.5 Pro may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, Google Gemini 2.5 Pro may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, Google Gemini 2.5 Pro may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Google Gemini 2.5 Pro can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Google Gemini 2.5 Pro might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the model’s output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2#2.5-pro

Open AI Whisper

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual data, offering improved robustness to accents, noise, and technical language. It transcribes and translates multiple languages into English.

Intended Use

Whisper is useful as an ASR solution, especially for English speech recognition.
The models are primarily trained and evaluated on ASR and speech translation to English.
They show strong ASR results in about 10 languages.
They may exhibit additional capabilities if fine-tuned for tasks like voice activity detection, speaker classification, or speaker diarization.

Performance

Speech recognition and translation accuracy is near state-of-the-art.
Performance varies across languages, with lower accuracy on low-resource or low-discoverability languages.
Whisper shows varying performance on different accents and dialects of languages.

Limitations

Whisper is trained in a weakly supervised manner using large-scale noisy data, leading to potential hallucinations.
Hallucinations occur as the models combine predicting the next word and transcribing audio.
The sequence-to-sequence architecture may generate repetitive text, which can be partially mitigated by beam search and temperature scheduling.
These issues may be more pronounced in lower-resource and/or lower-discoverability languages.
Higher word error rates may occur across speakers of different genders, races, ages, or other demographics.

Citation

https://openai.com/index/whisper/

0.00016 $ per compute second

Google Gemini 2.0 Pro

Gemini 2.0 Pro is the strongest model among the family of Gemini models for coding and world knowledge, and it features a 2M long context window. Gemini 2.0 Pro is available as an experimental model in Vertex AI and is an upgrade path for 1.5 Pro users who want better quality, or who are particularly invested in long context and code.

Intended Use

Multimodal input
Text output
Prompt optimizers
Controlled generation
Function calling (excluding compositional function calling)
Grounding with Google Search
Code execution
Count token

Performance

Google Gemini 2.0 Pro boasts the strongest coding performance and superior ability to handle complex prompts, demonstrating better understanding and reasoning of world knowledge than any previous model.

It features the largest context window of 2 million tokens, enabling comprehensive analysis and understanding of large amounts of information. Additionally, it can call external tools like Google Search and execute code, enhancing its utility for a wide range of tasks, including coding and knowledge analysis.

Limitations

Context: Gemini 2.0 Pro may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As Gemini 2.0 Pro is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, Gemini 2.0 Pro may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Gemini 2.0 Pro can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Gemini 2.0 Pro might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of Gemini 2.0 Pro's output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2#2.0-pro

0.00125 $ per 1000 prompt tokens

Open AI o3 mini

OpenAI o3-mini is a powerful and fast model that advances the boundaries of what small models can achieve. It delivers, exceptional STEM capabilities—with particular strength in science, math, and coding—all while maintaining the low cost and reduced latency of OpenAI o1-mini.

Intended Use

OpenAI o3-mini is designed to be used for tasks that require fast and efficient reasoning, particularly in technical domains like science, math, and coding. It’s optimized for STEM (Science, Technology, Engineering, and Mathematics) problem-solving, offering precise answers with improved speed compared to previous models.

Developers can use it for applications involving function calling, structured outputs, and other technical features. It’s particularly useful in contexts where both speed and accuracy are essential, such as coding, logical problem-solving, and complex technical inquiries.

Performance

OpenAI o3-mini performs exceptionally well in STEM tasks, particularly in science, math, and coding, with improvements in both speed and accuracy compared to its predecessor, o1-mini. It delivers faster responses, with an average response time 24% quicker than o1-mini (7.7 seconds vs. 10.16 seconds).

In terms of accuracy, it produces clearer, more accurate answers, with 39% fewer major errors on complex real-world questions. Expert testers preferred its responses 56% of the time over o1-mini. It also matches o1-mini’s performance in challenging reasoning evaluations, including AIME and GPQA, especially when using medium reasoning effort.

Limitations

No Vision Capabilities: Unlike some other models, o3-mini does not support visual reasoning tasks, so it's not suitable for image-related tasks.
Complexity in High-Intelligence Tasks: While o3-mini performs well in most STEM tasks, for extremely complex reasoning, it may still lag behind larger models.
Accuracy in Specific Domains: While o3-mini excels in technical domains, it might not always match the performance of specialized models in certain niche areas, particularly those outside of STEM.
Potential Trade-Off Between Speed and Accuracy: While users can adjust reasoning effort for a balance, higher reasoning efforts may lead to slightly longer response times.
Limited Fine-Tuning: Though optimized for general STEM tasks, fine-tuning for specific use cases might be necessary to achieve optimal results in more specialized areas.

Citation

https://openai.com/index/openai-o3-mini/

0.00110 $ per 1000 prompt tokens

Google Gemini 2.0 Flash

Gemini 2.0 Flash is designed to handle high-volume, high-frequency tasks at scale and is highly capable of multimodal reasoning across vast amounts of information with a context window of 1 million tokens.

Intended Use

Text generation
Grounding with Google Search
Gen AI SDK
Multimodal Live API
Bounding box detection
Image generation
Speech generation

Performance

Gemini 2.0 Flash outperforms the predecessor Gemini 1.5 Pro on key benchmarks, at twice the speed. It also features the following improvements:

Multimodal Live API: This new API enables low-latency bidirectional voice and video interactions with Gemini.
Quality: Enhanced performance across most quality benchmarks than Gemini 1.5 Pro.
Improved agentic capabilities: 2.0 Flash delivers improvements to multimodal understanding, coding, complex instruction following, and function calling. These improvements work together to support better agentic experiences.

Limitations

Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2#2.0-flash

0.00300 $ per 1000 prompt tokens

Claude 3.7 Sonnet

Claude 3.7 Sonnet, by Anthropic, can produce near-instant responses or extended, step-by-step thinking. Claude 3.7 Sonnet shows particularly strong improvements in coding and front-end web development.

Intended Use

Claude 3.7 Sonnet is designed to enhance real-world tasks by offering a blend of fast responses and deep reasoning, particularly in coding, web development, problem-solving, and instruction-following.

Optimized for real-world applications rather than competitive math or computer science problems.
Useful in business environments requiring a balance of speed and accuracy.
Ideal for tasks like bug fixing, feature development, and large-scale refactoring.

Coding Capabilities:

Strong in handling complex codebases, planning code changes, and full-stack updates.
Introduces Claude Code, an agentic coding tool that can edit files, write and run tests, and manage code repositories like GitHub.
Claude Code significantly reduces development time by automating tasks that would typically take 45+ minutes manually.

Performance

Claude Sonnet 3.7 combines the capabilities of a language model (LLM) with advanced reasoning, allowing users to choose between standard mode for quick responses and extended thinking mode for deeper reflection before answering. In extended thinking mode, Claude self-reflects, improving performance in tasks like math, physics, coding, and following instructions. Users can also control the thinking time via the API, adjusting the token budget to balance speed and answer quality.

Early testing demonstrated Claude’s superiority in coding, with significant improvements in handling complex codebases, advanced tool usage, and planning code changes. It also excels at full-stack updates and producing production-ready code with high precision, as seen in use cases with platforms like Vercel, Replit, and Canva. Claude's performance is particularly strong in developing sophisticated web apps, dashboards, and reducing errors. This makes it a top choice for developers working on real-world coding tasks.

Limitations

Context: Claude 3.7 Sonnet may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As Claude 3.7 Sonnet trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, Claude 3.7 Sonnet may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Claude 3.7 Sonnet can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Claude 3.7 Sonnet might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output of Claude 3.7 Sonnet are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://www.anthropic.com/news/claude-3-7-sonnet

0.00300 $ per 1000 prompt tokens

Amazon Nova Pro

Amazon Nova Pro is a highly capable multimodal model that combines accuracy, speed, and cost for a wide range of tasks.

The capabilities of Amazon Nova Pro, coupled with its focus on high speeds and cost efficiency, makes it a compelling model for almost any task, including video summarization, Q&A, mathematical reasoning, software development, and AI agents that can execute multistep workflows.

In addition to state-of-the-art accuracy on text and visual intelligence benchmarks, Amazon Nova Pro excels at instruction following and agentic workflows as measured by Comprehensive RAG Benchmark (CRAG), the Berkeley Function Calling Leaderboard, and Mind2Web.

Intended Use

Multimodal Processing: It can process and understand text, images, documents, and video, making it well suited for applications like video captioning, visual question answering, and other multimedia tasks.
Complex Language Tasks: Nova Pro is designed to handle complex language tasks with high accuracy, such as deep reasoning, multi-step problem solving, and mathematical problem-solving.
Agentic Workflows: It powers AI agents capable of performing multi-step tasks, integrated with retrieval-augmented generation (RAG) for improved accuracy and data grounding.
Customizable Applications: Developers can fine-tune it with multimodal data for specific use cases, such as enhancing accuracy, reducing latency, or optimizing cost.
Fast Inference: It’s optimized for fast response times, making it suitable for real-time applications in industries like customer service, automation, and content creation.

Performance

Amazon Nova Pro provides high performance, particularly in complex reasoning, multimodal tasks, and real-time applications, with speed and flexibility for developers.

Limitations

Domain Specialization: While it performs well across a variety of tasks, it may not always be as specialized in certain niche areas or highly specific domains compared to models fine-tuned for those purposes.
Resource-Intensive: As a powerful multimodal model, Nova Pro can require significant computational resources for optimal performance, which might be a consideration for developers working with large datasets or complex tasks.
Training Data: Nova Pro's performance is highly dependent on the quality and diversity of the multimodal data it's trained on. Its performance in tasks involving complex or obscure multimedia content might be less reliable.
Fine-Tuning Requirements: While customizability is a key feature, fine-tuning the model for very specific tasks or datasets might still require considerable effort and expertise from developers.

Citation

https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card

0.00320 $ per 1000 response tokens

Grok

Grok is a general purpose model that can be used for a variety of tasks, including generating and understanding text, code, and function calling.

Intended Use

Text and code: Generate code, extract data, prepare summaries and more.
Vision: Identify objects, analyze visuals, extract text from documents and more.
Function calling: Connect Grok to external tools and services for enriched interactions.

Performance

Limitations

Context: Grok may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.

Bias: As Grok trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.

Creativity Boundaries: While capable of creative outputs, Grok may not always meet specific creative standards or expectations for novel and nuanced content.

Ethical Concerns: Grok can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.

Comprehension: Grok might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.

Dependence on Prompt Quality: The quality and relevance of the output of Grok are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://x.ai/blog/grok-2

0.00950 $ per 1000 prompt tokens

Llama 3.2

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Intended Use

Llama 3.2's 11B and 90B models support image reasoning, enabling tasks like understanding charts/graphs, captioning images, and pinpointing objects based on language descriptions. For example, the models can answer questions about sales trends from a graph or trail details from a map. They bridge vision and language by extracting image details, understanding the scene, and generating descriptive captions to tell the story, making them powerful for both visual and textual reasoning tasks.

Llama 3.2's 1B and 3B models support multilingual text generation and on-device applications with strong privacy. Developers can create personalized, agentic apps where data stays local, enabling tasks like summarizing messages, extracting action items, and sending calendar invites for follow-up meetings.

Performance

Llama 3.2 vision models outperform competitors like Gemma 2.6B and Phi 3.5-mini in tasks like image recognition, instruction-following, and summarization.

Limitations

Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.

Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.

Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.

Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.

Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.

Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.

Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

0.00950 $ per 1000 prompt tokens

Claude 3.5 Haiku

Claude 3.5 Haiku, the next generation of Anthropic's fastest and most cost-effective model, is optimal for use cases where speed and affordability matter. It improves on its predecessor across every skill set.

Intended Use

Claude 3.5 Haiku offers fast speeds, improved instruction-following, and accurate tool use, making it ideal for user-facing products and personalized experiences.

Key use cases include code completion, streamlining development workflows with quick, accurate code suggestions. It powers interactive chatbots for customer service, e-commerce, and education, handling high user interaction volumes. It excels at data extraction and labeling, processing large datasets in sectors like finance and healthcare. Additionally, it provides real-time content moderation for safe online environments.

Performance

Limitations

Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.

Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.

Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.

Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.

Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.

Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://docs.anthropic.com/claude/docs/models-overview

0.00100 $ per 1000 prompt tokens

OpenAI o1-mini

The o1 series of large language models are designed to perform advanced reasoning through reinforcement learning. These models engage in deep internal thought processes before delivering responses, enabling them to handle complex queries. o1-mini is a small specialized model optimized for STEM-related reasoning during its pretraining. Despite its reduced size, the o1-mini undergoes the same high-compute reinforcement learning pipeline as the larger o1 models, achieving comparable performance on many reasoning tasks while being significantly more cost-efficient.

Performance

Human raters compared o1-mini to GPT-4o on challenging, open-ended prompts across various domains to assess performance and accuracy in different types of tasks. As seen from the graph above, o-1 mini is optimized for STEM related tasks.

Limitations

Optimization for STEM Knowledge: o1-mini is not optimized for tasks requiring non-STEM factual knowledge, which may result in less accurate responses when handling queries outside of technical or scientific domains.
Domain Preference: o1-mini is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains, where linguistic nuance and fluency are more critical.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

0.00300 $ per 1000 prompt tokens

OpenAI O1 Preview

The o1 series of large language models are designed to perform advanced reasoning through reinforcement learning. These models engage in deep internal thought processes before delivering responses, enabling them to handle complex queries. o1 preview is designed to reason about hard problems using broad general knowledge about the world.

Performance

Human raters compared o1-preview to GPT-4o on challenging, open-ended prompts across various domains to assess performance and accuracy in different types of tasks. As seen from the graph above, o-1 preview is optimized for STEM related tasks.

Limitations

Optimization for STEM Knowledge: o1-preview is not optimized for tasks requiring non-STEM factual knowledge, which may result in less accurate responses when handling queries outside of technical or scientific domains.
Domain Preference: o1-preview is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains, where linguistic nuance and fluency are more critical.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://openai.com/index/introducing-openai-o1-preview/

0.01500 $ per 1000 prompt tokens

Google Gemini 1.5 Flash

Google introduces Gemini 1.5 Flash that is faster and cheaper than Gemini 1.5 Pro. Gemini 1.5 Flash also has a context window of up to 1 million tokens, it can process vast amounts of information, including:

1 hour of video
11 hours of audio
over 700,000 words

This substantial capacity makes Gemini 1.5 Flash particularly well-suited for real-time and context-intensive applications, enabling seamless processing of large volumes of information.

Intended Use

Rapid text generation and completion
Real-time conversation and chatbot applications
Quick information retrieval and summarization
Multimodal understanding (text, images, audio, video)
Code generation and analysis
Task planning and execution

Limitations

May sacrifice some accuracy for speed compared to larger models
Performance on highly specialized or technical tasks may vary
Could exhibit biases present in its training data
Limited by the knowledge cutoff of its training data
Cannot access real-time information or browse the internet
May struggle with tasks requiring deep logical reasoning or complex mathematical computations

Citation

https://deepmind.google/technologies/gemini/flash/

0.00135 $ per 1000 response tokens

Amazon Rekognition

Common objects detection and image classification model by AWS Rekognition.

Intended Use

Amazon Rekognition's object detection model is primarily used for detecting objects, scenes, activities, landmarks, faces, dominant colors, and image quality in images and videos. Some common use cases include:

Detect and label common objects in images
Identify activities and scenes in visual content
Enable content moderation and filtering
Enhance image search capabilities

Learn more

Performance

Amazon Rekognition's object detection model has been reported to have high accuracy ind detecting objects and scenes in images and videos. Its capabilities include:

Can detect thousands of object categories
Provides bounding boxes for object locations
Assigns confidence scores to detections

Learn more

Limitations

the performance of the model may be limited by factors such as the quality and quantity of training data, the complexity of the image content, or the accuracy of the annotations. Additionally, Amazon Rekognition may have detection issues with black and white images and elderly people.

Other limitations include:

May struggle with small or partially obscured objects
Performance can vary based on image quality and lighting
Limited ability to understand context or relationships between objects
Cannot identify specific individuals (separate face recognition API for that)
May have biases in detection rates across different demographics

Citation

Amazon Rekognition documentation

0.001 $ per api call

OpenAI GPT-4o

GPT-4o (“o” for “omni”) is the most advanced OpenAI model. It is multimodal (accepting text or image inputs and outputting text), and it has the same high intelligence as GPT-4 Turbo but is much more efficient—it generates text 2x faster and is 50% cheaper. Additionally, GPT-4o has the best vision and performance across non-English languages of any OpenAI model.

Performance

As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.

Limitations

Accuracy: While GPT-4o can provide detailed and accurate responses, it may occasionally generate incorrect or nonsensical answers, particularly for highly specialized or obscure queries.
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.

Citation

https://platform.openai.com/docs/models/gpt-4o

0.00500 $ per 1000 prompt tokens

BLIP2 (blip2-flan-t5-xxl)

BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. This model is the BLIP-2, Flan-T5-XXL variant.

Intended Use

BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. It can also be used for chat-like conversations by feeding the image and the previous conversation as prompt to the model.

Performance

Best performance within the BLIP2 family of models.

Limitations

BLIP2 is fine-tuned on image-text datasets (e.g. LAION) collected from the internet. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data. Other limitations include:

May struggle with highly abstract or culturally specific visual concepts
Performance can vary based on image quality and complexity
Limited by the training data of its component models (vision encoder and language model)
Cannot generate or edit images (only processes and describes them)
Requires careful prompt engineering for optimal performance in some tasks

Citation

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.

$0.004 per compute second

Amazon Textract

Use this optical character recognition (OCR) model to extract text from images. The model will take images as input and generate text annotations, classifying them within bounding boxes. The bounding boxes will be grouped by words.

Intended Use

Amazon Textract extracts text, handwriting, and structured data from scanned documents, including forms and tables, surpassing basic OCR capabilities. It provides extracted data with bounding box coordinates and confidence scores to help users accurately assess and utilize the information.

Performance

Custom Queries: Amazon Textract allows customization of its pretrained Queries feature to enhance accuracy for specific document types while retaining data control. Users can upload and annotate a minimum of ten sample documents through the AWS Console to tailor the Queries feature within hours.
Layout: Amazon Textract extracts various layout elements from documents, including paragraphs, titles, and headers, via the Analyze Document API. This feature can be used independently or in conjunction with other document analysis features.
Optical Character Recognition (OCR): Textract’s OCR detects both printed and handwritten text from documents and images, handling various fonts, styles, and text distortions through machine learning. It is capable of recognizing text in noisy or distorted conditions.
Form Extraction: Textract identifies and retains key-value pairs from documents automatically, preserving their context for easier database integration. Unlike traditional OCR, it maintains the relationship between keys and values without needing custom rules.
Table Extraction: The service extracts and maintains the structure of tabular data in documents, such as financial reports or medical records, allowing for easy import into databases. Data in rows and columns, like inventory reports, is preserved for accurate application.
Signature Detection: Textract detects signatures on various documents and images, including checks and loan forms, and provides the location and confidence scores of these signatures in the API response.
Query-Based Extraction: Textract enables data extraction using natural language queries, eliminating the need to understand document structure or format variations. It’s pre-trained on a diverse set of documents, reducing post-processing and manual review needs.
Analyze Lending: The Analyze Lending API automates the extraction and classification of information from mortgage loan documents. It uses preconfigured machine learning models to organize and process loan packages upon upload.
Invoices and Receipts: Textract leverages machine learning to extract key data from invoices and receipts, such as vendor names, item prices, and payment terms, despite varied layouts. This reduces the complexity of manual data extraction.
Identity Documents: Textract uses ML to extract and understand details from identity documents like passports and driver’s licenses, including implied information. This facilitates automated processes in ID verification, account creation, and more without template reliance.

Limitations

May struggle with highly stylized fonts or severe document degradation
Handwriting recognition accuracy can vary based on writing style
Performance may decrease with complex, multi-column layouts
Limited ability to understand document context or interpret extracted data
May have difficulty with non-Latin scripts or specialized notation

Citation

https://docs.aws.amazon.com/textract/

0.0015 $ per api call

Claude 3 Haiku

Claude 3 Haiku stands out as the fastest and most cost-effective model in its class. Boasting cutting-edge vision capabilities and exceptional performance on industry benchmarks, Haiku offers a versatile solution for a broad spectrum of enterprise applications.

Intended Use

Performance

Near-instant results: The Claude 3 models excel in powering real-time tasks such as live customer chats, auto-completions, and data extraction. Haiku, the fastest and most cost-effective model, and Sonnet, which is twice as fast as Claude 2 and 2.1, both offer superior intelligence and performance for a variety of demanding applications.
Vision and Image Processing: This model can process and analyze visual input, extracting insights from documents, processing web UI, generating image catalog metadata, and more.
Long context and near-perfect recall: Haiku offers a 200K context window and can process inputs exceeding 1 million tokens, with Claude 3 Opus achieving near-perfect recall surpassing 99% accuracy in the "Needle In A Haystack".

Limitations

Medical images: Claude 3 Haiku is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: Claude 3 Haiku may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for Claude 3.5, but avoid cropping important details.
Rotation: Claude 3 Haiku may misinterpret rotated / upside-down text or images.
Visual elements: Claude 3 Haiku may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: Claude 3 Haiku struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Hallucinations: The model can provide factually inaccurate information.

Citation

https://docs.anthropic.com/claude/docs/models-overview

0.00125 $ per 1000 response tokens

Llama 3.1 405B

Llama 3.1 builds upon the success of its predecessors, offering enhanced performance, improved safety measures, and greater flexibility for researchers and developers. It demonstrates exceptional proficiency in language understanding, generation, and reasoning tasks, making it a powerful tool for a wide range of applications. It is one of the most powerful open source AI models, which you can fine-tune, distill and deploy anywhere. The latest instruction-tuned model is available in 8B, 70B and 405B versions.

Intended Use

Research and Development: Ideal for exploring cutting-edge AI research, developing new model architectures, and fine-tuning for specific tasks.
Open-Source Community: Designed to foster collaboration and accelerate innovation in the open-source AI community.
Education and Experimentation: A valuable resource for students and researchers to learn about and experiment with state-of-the-art LLM technology.

Performance

Enhanced Performance: Llama 3.1 boasts improvements in various benchmarks, including language modeling, question answering, and text summarization.
Improved Safety: The model has undergone rigorous safety training to reduce the risk of generating harmful or biased outputs.
Increased Flexibility: Llama 3.1 is available in multiple sizes, allowing users to choose the model that best suits their compute resources and specific needs.

Limitations

Data Freshness: The pretraining data has a cutoff of December 2023.

Citations

https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md

0.00950 $ per 1000 response tokens

Claude 3.5 Sonnet

Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval). It shows marked improvement in grasping nuance, humor, and complex instructions, and is exceptional at writing high-quality content with a natural, relatable tone. Claude 3.5 Sonnet operates at twice the speed of Claude 3 Opus. This performance boost, combined with cost-effective pricing, makes Claude 3.5 Sonnet ideal for complex tasks such as context-sensitive customer support and orchestrating multi-step workflows.

Intended Use

Task automation: plan and execute complex actions across APIs and databases, interactive coding
R&D: research review, brainstorming and hypothesis generation, drug discovery
Strategy: advanced analysis of charts & graphs, financials and market trends, forecasting

Performance

Advanced Coding ability: In an internal evaluation by Anthropic, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%.
Multilingual Capabilities: Claude 3.5 Sonnet offers improved fluency in non-English languages such as Spanish and Japanese, enabling use cases like translation services and global content creation.
Vision and Image Processing: This model can process and analyze visual input, extracting insights from documents, processing web UI, generating image catalog metadata, and more.
Steerability and Ease of Use: Claude 3.5 Sonnet is designed to be easy to steer and better at following directions, giving you more control over model behavior and more predictable, higher-quality outputs.

Limitations

Here are some of the limitations we are aware of:

Medical images: Claude 3.5 is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: Claude 3.5 may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for Claude 3.5, but avoid cropping important details.
Rotation: Claude 3.5 may misinterpret rotated / upside-down text or images.
Visual elements: Claude 3.5 may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: Claude 3.5 struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Hallucinations: the model can provide factually inaccurate information.
Image shape: Claude 3.5 struggles with panoramic and fisheye images.
Metadata and resizing: Claude 3.5 doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.
Counting: Claude 3.5 may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, Claude 3.5 has a system to block the submission of CAPTCHAs.

Citation

https://docs.anthropic.com/claude/docs/models-overview

0.01500 $ per 1000 response tokens

Grounding Dino + SAM

Grounding Dino + SAM, or Grounding SAM, uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models by enabling users to create segmentation masks quickly.

Intended Use

Create segmentation masks using SAM and classify the masks using Grounding Dino. The masks are intended to be used as pre-labels.

Limitations

Inaccurate classification might occur, especially for aerial images for classification like roof and solar panels.
The accuracy of masks is suboptimal in areas with complex shapes, low contrast zones, and small objects.

Citation

Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others. (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499

Chen, Jiaqi and Yang, Zeyu and Zhang, Li. (2023). Semantic Segment Anything. https://github.com/fudan-zvg/Semantic-Segment-Anything

0.00145 $ per compute second

Claude 3 Opus

Claude 3 Opus represents the most powerful and advanced model in the Claude 3 family. This state-of-the-art AI delivers unparalleled performance on highly complex tasks, demonstrating fluency and human-like understanding that sets it apart from other AI models.

Intended Use

Task automation: plan and execute complex actions across APIs and databases, interactive coding
R&D: research review, brainstorming and hypothesis generation, drug discovery
Strategy: advanced analysis of charts & graphs, financials and market trends, forecasting

Use cases

Multilingual Capabilities: Claude 3 Opus offers improved fluency in non-English languages such as Spanish and Japanese, enabling use cases like translation services and global content creation.
Vision and Image Processing: This model can process and analyze visual input, extracting insights from documents, processing web UI, generating image catalog metadata, and more.
Steerability and Ease of Use: Claude 3 Opus is designed to be easy to steer and better at following directions, giving you more control over model behavior and more predictable, higher-quality outputs.

Performance

Benchmark

GPT-4

Evaluated few-shot

Few-shot SOTA

SOTA

Best external model (includes benchmark-specific training)

VQAv2

VQA score (test-dev)

77.2%

0-shot

67.6%

Flamingo 32-shot

84.3%

PaLI-17B

TextVQA

VQA score (val)

78.0%

0-shot

37.9%

Flamingo 32-shot

71.8%

PaLI-17B

ChartQA

Relaxed accuracy (test)

78.5%

58.6%

Pix2Struct Large

AI2 Diagram (AI2D)

Accuracy (test)

78.2%

0-shot

42.1%

Pix2Struct Large

DocVQA

ANLS score (test)

88.4%

0-shot

88.4%

ERNIE-Layout 2.0

Infographic VQA

ANLS score (test)

75.1%

0-shot

61.2%

Applica.ai TILT

TVQA

Accuracy (val)

87.3%

0-shot

86.5%

MERLOT Reserve Large

LSMDC

Fill-in-the-blank accuracy (test)

45.7%

0-shot

31.0%

MERLOT Reserve Large

52.9%

MERLOT

Limitations

Here are some of the limitations we are aware of:

Medical images: Claude 3 is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Spatial reasoning: Claude 3 struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Hallucinations: the model can provide factually inaccurate information.
Image shape: Claude 3 struggles with panoramic and fisheye images.
Metadata and resizing: Claude 3 doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.
CAPTCHAS: For safety reasons, Claude 3 has a system to block the submission of CAPTCHAs.

Citation

https://docs.anthropic.com/claude/docs/models-overview

0.000075 $ per response token

Google Gemini 1.5 Pro

Google Gemini 1.5 Pro is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.

Intended Use

Gemini 1.5, the next-generation model from Google, offers significant performance enhancements. Built on a Mixture-of-Experts (MoE) architecture, it delivers improved efficiency in training and serving. Gemini 1.5 Pro, the mid-size multimodal model, exhibits comparable quality to the larger 1.0 Ultra, featuring a breakthrough experimental long-context understanding feature. With a context window of up to 1 million tokens, it can process vast amounts of information, including:

1 hour of video
11 hours of audio
over 700,000 words

Use cases

Gemini is good at a wide variety of multimodal use cases, including but not limited to:

Info Seeking: Fusing world knowledge with information extracted from the images and videos.
Object Recognition: Answering questions related to fine-grained identification of the objects in images and videos.
Digital Content Understanding: Answering questions and extracting information from various contents like infographics, charts, figures, tables, and web pages.
Structured Content Generation: Generating responses in formats like HTML and JSON, based on provided prompt instructions.
Captioning / Description: Generating descriptions of images and videos with varying levels of detail. For example, for images, the prompt can be: “Can you write a description about the image?”. For videos, the prompt can be: “Can you write a description about what's happening in this video?
Extrapolations: Suggesting what else to see based on location, what might happen next/before/between images or videos, and enabling creative uses like writing stories based on visual inputs.

Limitations

Here are some of the limitations we are aware of:

Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Hallucinations: the model can provide factually inaccurate information.
Counting: Gemini Pro may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, Gemini Pro has a system to block the submission of CAPTCHAs.
Multi-turn (multimodal) chat: Not trained for chatbot functionality or answering questions in a chatty tone, and can perform less effectively in multi-turn conversations.
Following complex instructions: Can struggle with tasks requiring multiple reasoning steps. Consider breaking down instructions or providing few-shot examples for better guidance.
Counting: Can only provide rough approximations of object counts, especially for obscured objects.
Spatial reasoning: Can struggle with precise object/text localization in images. It may be less accurate in understanding rotated images.

Citation

https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note

$0 compute costs for a limited time

Google Gemini Pro

Google Gemini is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.

Intended Use

Gemini is designed to process and reason across different inputs like text, images, video, and code. On Labelbox platform, Gemini supports wide range of image and language tasks such as text generation, question answering, classification, visual understanding, answering questions about math, etc.

Performance

Gemini is Google’s largest and most capable model to date. It is the first AI model to surpass human experts on the Massive Multitask Language Understanding (MMLU) benchmark, and supposed SOTA performances on multi-modal tasks.

Limitations

There is a continued need for research and development on reducing “hallucinations” generated by LLMs. LLMs also struggle with tasks requiring high level reasoning abilities such as casual understanding, logical deduction, and counterfactual reasoning.

Citation

Technical report on Gemini: a Family of Highly Capable Multimodal Models

$0 compute costs for a limited time

Tesseract OCR

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google (Source)

Intended Use

This model uses Tesseract (https://github.com/tesseract-ocr/tesseract) for OCR, and writes the output as a text annotation.

Performance

Tesseract works best with straight, well scanned text. For text in the wild, handwriting and other use cases, other models should be used.

Citation

https://github.com/tesseract-ocr/tesseract

$0.0003 per compute second

Google Imagen

Google's Imagen generates a relevant description or caption for a given image.

Intended Use

The model generates image captioning, allowing users to generate a relevant description for an image. You can use this information for a variety of use cases:

Creators can generate captions for uploaded images
Generate captions to describe products
Integrate Imagen captioning with an app using the API to create new experiences

Imagen currently supports five languages: English, German, French, Spanish and Italian.

Performance

The Imagen model has reported to achieve high accuracy, however may have limitations in generating captions for complex or abstract images. The model may also generate captions that reflect biases present in the training data.

Citations

Google Image captioning documentation Google visual question answering documentation

$0.0015 per api call

BLIP2 (OPT 6.7B)

BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. This model is the BLIP-2, OPT 6.7B variant

Intended Use

BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering.

Performance

BLIP2 is fine-tuned on image-text datasets (e.g. LAION ) collected from the internet. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

Citation

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.

$0.002 per compute second

Llava13B

LLaVA v1.5 is a Large Language and Vision Assistant.

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Intended Use

Multimodal Chat Capabilities: LLaVA has been designed to exhibit impressive chat capabilities, particularly when fine-tuned alongside models like GPT-4, for general-purpose visual and language understanding.

Visual Instruction Following: LLaVA is a pioneering effort in the field of Visual Instruction Tuning, a technique that fine-tunes large language models to understand and execute instructions based on visual cues. This is particularly beneficial in scenarios where a model needs to describe an image, perform an action in a virtual environment, or answer questions about a scene in a photograph.

Performance

Early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.

Limitations

Hallucination: Similar to GPT-3.5, LLaVA has limitations like hallucination, where the model might generate inaccurate or false information.

Mathematical Problem-Solving: LLaVA faces limitations in mathematical problem-solving, an area where other models or systems might have superior capabilities.

Translation Tasks: LLaVA 1.5 struggles with translation tasks, indicating a potential limitation in language translation capabilities.

Citation

https://llava-vl.github.io/

License

https://github.com/haotian-liu/LLaVA/blob/main/LICENSE

$0.000925 per compute second

Grounding DINO

Open-set object detector that by combines a Transformer-based detector DINO with grounded pre-training. It can detect arbitrary objects with human inputs such as category names or referring expressions.

Intended Use

Useful for zero shot object detection tasks.

Performance

Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP

Citation

https://github.com/IDEA-Research/GroundingDINO

https://arxiv.org/abs/2303.05499

$0.0003 per compute second

BLIP2 (Flan-T5 XL COCO)

BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering.

Intended Use

BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering.

Performance

BLIP-2 ViT-g OPT2.7B has a score of 52.3 on VQAv2 dataset

Citation

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.

$ 0.003 per compute second

OpenAI GPT4

ChatGPT is an advanced conversational artificial intelligence language model developed by OpenAI. This It is based on the GPT-4 architecture and has been trained on a diverse range of internet text to generate human-like responses in natural language conversations. This model is latest version.

Intended Use

GPT stands for Generative Pre-trained Transformer (GPT), a type of language model that uses deep learning to generate human-like, conversational text. As a multimodal model, GPT-4 is able to accept both text and image outputs.

However, OpenAI has not yet made the GPT-4 model's visual input capabilities available through any platform. Currently the only way to access the text-input capability through OpenAI is with a subscription to ChatGPT Plus.

The GPT-4 model is optimized for conversational interfaces and can be used to generate text summaries, reports, and responses. Currently, only text modality is supported.

Performance

GPT-4 is a highly advanced model that can accept both image and text inputs, making it more versatile than its predecessor, GPT-3. However, it is important to use the appropriate techniques to get the best results, as the model behaves differently than older GPT models.

OpenAI published results for the GPT-4 model comparing it to other state-of-the-art models (SOTA) including its previous GPT-3.5 model.

Benchmark

GPT-4

Evaluated few-shot

GPT-3.5

Evaluated few-shot

LM SOTA

Best external LM evaluated few-shot

SOTA

Best external model (includes benchmark-specific training)

MMLU

Multiple-choice questions in 57 subjects (professional & academic)

86.4%

5-shot

70.0%

5-shot

70.7%

5-shot U-PaLM

75.2%

5-shot Flan-PaLM

HellaSwag

Commonsense reasoning around everyday events

95.3% 10-shot

85.5%

10-shot

84.2%

LLAMA (validation set)

85.6%

ALUM

AI2

Reasoning Challenge (ARC)

Grade-school multiple choice science questions. Challenge-set.

96.3%

25-shot

85.2%

25-shot

84.2%

8-shot PaLM

85.6%

ST-MOE

WinoGrande

Commonsense reasoning around pronoun resolution

87.5%

5-shot

81.6%

5-shot

84.2%

5-shot PALM

85.6%

5-shot PALM

HumanEval

Python coding tasks

67.0%

0-shot

48.1%

0-shot

26.2%

0-shot PaLM

65.8%

CodeT + GPT-3.5

DROP

(f1 score)

Reading comprehension & arithmetic.

80.9

3-shot

64.1

3-shot

70.8

1-shot PaLM

88.4

QDGAT

Limitations

The underlying format of the GPT-4 model is more likely to change over time, and it may provide less useful responses if interacted with in the same way as older models. The GPT-4 model has similar limitations to previous GPT models, such as being prone to LLM hallucination and reasoning errors. OpenAI claims that GPT-4 hallucinates less often than other models, regardless.

Limitations

https://openai.com/research/gpt-4

$0.00006 per token

OWL-ViT

The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.

OWL-ViT is an open-vocabulary object detection network trained on a variety of (image, text) pairs. It can be used to query an image with one or multiple text queries to search for and detect target objects described in text.

Intended Use

OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.

Performance

OWL-ViT achieves zero-shot detection results competitive with much more complex approaches on the challenging LVIS benchmark and outperforms pre-existing methods on image-conditioned detection by a large margin.

$0.0003 per compute second

CLIP ViT LAION (Classification)

Zero shot image classification. The model will produce exactly one prediction per classification task unless the max classification score is less than the confidence threshold.

Intended Use

Zero shot image classification. The model will produce exactly one prediction per classification task unless the max classification score is less than the confidence threshold.

A CLIP ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).

Learn more at Hugging Face

$ 0.0018 per compute second

OpenAI GPT-3.5 Turbo

ChatGPT is an advanced conversational artificial intelligence language model developed by OpenAI. It is based on the GPT-3.5 architecture and has been trained on a diverse range of internet text to generate human-like responses in natural language conversations. This model is latest version.

Intended Use

The primary intended use of ChatGPT-3.5 is to provide users with a conversational AI system that can assist with a wide range of language-based tasks. It is designed to engage in interactive conversations, answer questions, provide explanations, offer suggestions, and facilitate information retrieval. It can also engage in more specialized conversations such as prose writing, programming, script and dialogues, and explaining scientific concepts to varying degrees of complexity.

ChatGPT-3.5 can also be employed in customer service applications, virtual assistants, educational tools, and other systems that require natural language understanding and generation.

Performance

ChatGPT-3.5 has demonstrated strong performance across various language tasks, including understanding and generating text in a conversational context. It is capable of producing coherent and contextually relevant responses to user input as well as storing information short-term to offer meaningful information and engage in meaningful dialogue with a user.

The model has been trained on a vast amount of internet text, enabling it to leverage a wide range of knowledge and information. OpenAI has

However, it is important to note that ChatGPT-3.5 may occasionally produce incorrect or nonsensical answers, especially when presented with ambiguous queries or lacking relevant context.

Limitations

While ChatGPT-3.5 exhibits impressive capabilities, it also has certain limitations that users should be aware of:

Lack of Real-Time Information: ChatGPT-3.5’s training data is current until September 2021. Therefore, it may not be aware of recent events or have access to real-time information. Consequently, it may provide outdated or inaccurate responses to queries related to current affairs or time-sensitive topics.

Sensitivity to Input Phrasing: ChatGPT-3.5 is sensitive to slight rephrasing of questions or prompts. While it strives to generate consistent responses, minor changes in phrasing can sometimes lead to different answers or interpretations. Users should be mindful of this when interacting with ChatGPT.

Propensity for Biases: ChatGPT-3.5 is trained on a broad range of internet text, which may include biased or objectionable content. While efforts have been made to mitigate biases during training, the model may still exhibit some biases or respond to sensitive topics inappropriately. It is important to use ChatGPT's responses critically and be aware of potential biases.

Inability to Verify Information: ChatGPT-3.5 does not have the capability to verify the accuracy or truthfulness of the information it generates. It relies solely on patterns in the training data and may occasionally provide incorrect or misleading information. Users are encouraged to independently verify any critical or factual information obtained from the model.

Lack of Context Awareness: Although ChatGPT-3.5 can maintain short-term context within a conversation, it lacks long-term memory. Consequently, it may sometimes provide inconsistent or contradictory answers within the same conversation. Users should ensure they provide sufficient context to minimize potential misunderstandings.

LLM Hallucation: ChatGPT-3.5, much like many other large language models, is prone to a phenomenon called “LLM Hallucation”. At its core, GPT3.5 and other LLMs are neural networks trained on a large amount of text data. It is a statistical machine, and essentially "learns" to predict the next word in a sentence based on the context provided by preceding words. As a result, LLM hallucination occurs when the model's primary objective is to generate text that is coherent and contextually appropriate, rather than factually accurate.

In addition, OpenAI has since released GPT-4, which makes significant improvements on the GPT-3.5 architecture. From parameters, to memory, and now accepting both text and image data compared to GPT-3.5, GPT-4.0 is a significant improvement.

Citation

https://platform.openai.com/docs/models

$0.000002 per token

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Models

NLP

Computer vision

Google Gemini 2.5 Pro

Intended Use

Performance

Limitations

Citation

Open AI Whisper

Intended Use

Performance

Limitations

Citation

Google Gemini 2.0 Pro

Intended Use

Performance

Limitations

Citation

Open AI o3 mini

Intended Use

Performance

Limitations

Citation

Google Gemini 2.0 Flash

Intended Use

Performance

Limitations

Citation

Claude 3.7 Sonnet

Intended Use

Performance

Limitations

Citation

Amazon Nova Pro

Intended Use

Performance

Limitations

Citation

Grok

Intended Use

Performance

Limitations

Citation

Llama 3.2

Intended Use

Performance

Limitations

Citation

Claude 3.5 Haiku

Intended Use

Performance

Limitations

Citation

OpenAI o1-mini

Performance

Limitations

Citation

OpenAI O1 Preview

Performance

Limitations

Citation

Google Gemini 1.5 Flash

Intended Use

Limitations

Citation

Amazon Rekognition

Intended Use

Performance

Limitations

Citation

OpenAI GPT-4o

Performance

Limitations

Citation