Models
Google Gemini 2.0 Pro
Gemini 2.0 Pro is the strongest model among the family of Gemini models for coding and world knowledge, and it features a 2M long context window. Gemini 2.0 Pro is available as an experimental model in Vertex AI and is an upgrade path for 1.5 Pro users who want better quality, or who are particularly invested in long context and code.
Intended Use
Multimodal input
Text output
Prompt optimizers
Controlled generation
Function calling (excluding compositional function calling)
Grounding with Google Search
Code execution
Count token
Performance
Google Gemini 2.0 Pro boasts the strongest coding performance and superior ability to handle complex prompts, demonstrating better understanding and reasoning of world knowledge than any previous model.
It features the largest context window of 2 million tokens, enabling comprehensive analysis and understanding of large amounts of information. Additionally, it can call external tools like Google Search and execute code, enhancing its utility for a wide range of tasks, including coding and knowledge analysis.

Limitations
Context: Gemini 2.0 Pro may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As Gemini 2.0 Pro is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, Gemini 2.0 Pro may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Gemini 2.0 Pro can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Gemini 2.0 Pro might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of Gemini 2.0 Pro's output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2#2.0-pro
Google Gemini 2.0 Flash
Gemini 2.0 Flash is designed to handle high-volume, high-frequency tasks at scale and is highly capable of multimodal reasoning across vast amounts of information with a context window of 1 million tokens.
Intended Use
Text generation
Grounding with Google Search
Gen AI SDK
Multimodal Live API
Bounding box detection
Image generation
Speech generation
Performance
Gemini 2.0 Flash outperforms the predecessor Gemini 1.5 Pro on key benchmarks, at twice the speed. It also features the following improvements:
Multimodal Live API: This new API enables low-latency bidirectional voice and video interactions with Gemini.
Quality: Enhanced performance across most quality benchmarks than Gemini 1.5 Pro.
Improved agentic capabilities: 2.0 Flash delivers improvements to multimodal understanding, coding, complex instruction following, and function calling. These improvements work together to support better agentic experiences.

Limitations
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2#2.0-flash
Claude 3.7 Sonnet
Claude 3.7 Sonnet, by Anthropic, can produce near-instant responses or extended, step-by-step thinking. Claude 3.7 Sonnet shows particularly strong improvements in coding and front-end web development.
Intended Use
Claude 3.7 Sonnet is designed to enhance real-world tasks by offering a blend of fast responses and deep reasoning, particularly in coding, web development, problem-solving, and instruction-following.
Optimized for real-world applications rather than competitive math or computer science problems.
Useful in business environments requiring a balance of speed and accuracy.
Ideal for tasks like bug fixing, feature development, and large-scale refactoring.
Coding Capabilities:
Strong in handling complex codebases, planning code changes, and full-stack updates.
Introduces Claude Code, an agentic coding tool that can edit files, write and run tests, and manage code repositories like GitHub.
Claude Code significantly reduces development time by automating tasks that would typically take 45+ minutes manually.
Performance
Claude Sonnet 3.7 combines the capabilities of a language model (LLM) with advanced reasoning, allowing users to choose between standard mode for quick responses and extended thinking mode for deeper reflection before answering. In extended thinking mode, Claude self-reflects, improving performance in tasks like math, physics, coding, and following instructions. Users can also control the thinking time via the API, adjusting the token budget to balance speed and answer quality.

Early testing demonstrated Claude’s superiority in coding, with significant improvements in handling complex codebases, advanced tool usage, and planning code changes. It also excels at full-stack updates and producing production-ready code with high precision, as seen in use cases with platforms like Vercel, Replit, and Canva. Claude's performance is particularly strong in developing sophisticated web apps, dashboards, and reducing errors. This makes it a top choice for developers working on real-world coding tasks.
Limitations
Context: Claude 3.7 Sonnet may struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As Claude 3.7 Sonnet trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, Claude 3.7 Sonnet may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Claude 3.7 Sonnet can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Claude 3.7 Sonnet might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output of Claude 3.7 Sonnet are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://www.anthropic.com/news/claude-3-7-sonnet
Claude 3.5 Haiku
Claude 3.5 Haiku, the next generation of Anthropic's fastest and most cost-effective model, is optimal for use cases where speed and affordability matter. It improves on its predecessor across every skill set.
Intended Use
Claude 3.5 Haiku offers fast speeds, improved instruction-following, and accurate tool use, making it ideal for user-facing products and personalized experiences.
Key use cases include code completion, streamlining development workflows with quick, accurate code suggestions. It powers interactive chatbots for customer service, e-commerce, and education, handling high user interaction volumes. It excels at data extraction and labeling, processing large datasets in sectors like finance and healthcare. Additionally, it provides real-time content moderation for safe online environments.
Performance

Limitations
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://docs.anthropic.com/claude/docs/models-overview
BLIP2 (blip2-flan-t5-xxl)
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. This model is the BLIP-2, Flan-T5-XXL variant.
Intended Use
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. It can also be used for chat-like conversations by feeding the image and the previous conversation as prompt to the model.
Performance
Best performance within the BLIP2 family of models.
Limitations
BLIP2 is fine-tuned on image-text datasets (e.g. LAION) collected from the internet. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data. Other limitations include:
May struggle with highly abstract or culturally specific visual concepts
Performance can vary based on image quality and complexity
Limited by the training data of its component models (vision encoder and language model)
Cannot generate or edit images (only processes and describes them)
Requires careful prompt engineering for optimal performance in some tasks
Citation
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
Grounding Dino + SAM
Grounding Dino + SAM, or Grounding SAM, uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models by enabling users to create segmentation masks quickly.
Intended Use
Create segmentation masks using SAM and classify the masks using Grounding Dino. The masks are intended to be used as pre-labels.
Limitations
Inaccurate classification might occur, especially for aerial images for classification like roof and solar panels.
The accuracy of masks is suboptimal in areas with complex shapes, low contrast zones, and small objects.
Citation
Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others. (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499
Chen, Jiaqi and Yang, Zeyu and Zhang, Li. (2023). Semantic Segment Anything. https://github.com/fudan-zvg/Semantic-Segment-Anything
Google Gemini 1.5 Pro
Google Gemini 1.5 Pro is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.
Intended Use
Gemini 1.5, the next-generation model from Google, offers significant performance enhancements. Built on a Mixture-of-Experts (MoE) architecture, it delivers improved efficiency in training and serving. Gemini 1.5 Pro, the mid-size multimodal model, exhibits comparable quality to the larger 1.0 Ultra, featuring a breakthrough experimental long-context understanding feature. With a context window of up to 1 million tokens, it can process vast amounts of information, including:
1 hour of video
11 hours of audio
over 700,000 words
Use cases
Gemini is good at a wide variety of multimodal use cases, including but not limited to:
Info Seeking: Fusing world knowledge with information extracted from the images and videos.
Object Recognition: Answering questions related to fine-grained identification of the objects in images and videos.
Digital Content Understanding: Answering questions and extracting information from various contents like infographics, charts, figures, tables, and web pages.
Structured Content Generation: Generating responses in formats like HTML and JSON, based on provided prompt instructions.
Captioning / Description: Generating descriptions of images and videos with varying levels of detail. For example, for images, the prompt can be: “Can you write a description about the image?”. For videos, the prompt can be: “Can you write a description about what's happening in this video?
Extrapolations: Suggesting what else to see based on location, what might happen next/before/between images or videos, and enabling creative uses like writing stories based on visual inputs.
Limitations
Here are some of the limitations we are aware of:
Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Hallucinations: the model can provide factually inaccurate information.
Counting: Gemini Pro may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, Gemini Pro has a system to block the submission of CAPTCHAs.
Multi-turn (multimodal) chat: Not trained for chatbot functionality or answering questions in a chatty tone, and can perform less effectively in multi-turn conversations.
Following complex instructions: Can struggle with tasks requiring multiple reasoning steps. Consider breaking down instructions or providing few-shot examples for better guidance.
Counting: Can only provide rough approximations of object counts, especially for obscured objects.
Spatial reasoning: Can struggle with precise object/text localization in images. It may be less accurate in understanding rotated images.
Citation
https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note
Google Gemini Pro
Google Gemini is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.
Intended Use
Gemini is designed to process and reason across different inputs like text, images, video, and code. On Labelbox platform, Gemini supports wide range of image and language tasks such as text generation, question answering, classification, visual understanding, answering questions about math, etc.
Performance
Gemini is Google’s largest and most capable model to date. It is the first AI model to surpass human experts on the Massive Multitask Language Understanding (MMLU) benchmark, and supposed SOTA performances on multi-modal tasks.
Limitations
There is a continued need for research and development on reducing “hallucinations” generated by LLMs. LLMs also struggle with tasks requiring high level reasoning abilities such as casual understanding, logical deduction, and counterfactual reasoning.
Citation
Technical report on Gemini: a Family of Highly Capable Multimodal Models
Tesseract OCR
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google (Source)
Intended Use
This model uses Tesseract (https://github.com/tesseract-ocr/tesseract) for OCR, and writes the output as a text annotation.
Performance
Tesseract works best with straight, well scanned text. For text in the wild, handwriting and other use cases, other models should be used.
Citation
https://github.com/tesseract-ocr/tesseract
Google Imagen
Google's Imagen generates a relevant description or caption for a given image.
Intended Use
The model generates image captioning, allowing users to generate a relevant description for an image. You can use this information for a variety of use cases:
Creators can generate captions for uploaded images
Generate captions to describe products
Integrate Imagen captioning with an app using the API to create new experiences
Imagen currently supports five languages: English, German, French, Spanish and Italian.
Performance
The Imagen model has reported to achieve high accuracy, however may have limitations in generating captions for complex or abstract images. The model may also generate captions that reflect biases present in the training data.
Citations
Google Image captioning documentation Google visual question answering documentation
Grounding DINO
Open-set object detector that by combines a Transformer-based detector DINO with grounded pre-training. It can detect arbitrary objects with human inputs such as category names or referring expressions.
Intended Use
Useful for zero shot object detection tasks.
Performance
Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP
Citation
OpenAI GPT4
ChatGPT is an advanced conversational artificial intelligence language model developed by OpenAI. This It is based on the GPT-4 architecture and has been trained on a diverse range of internet text to generate human-like responses in natural language conversations. This model is latest version.
Intended Use
GPT stands for Generative Pre-trained Transformer (GPT), a type of language model that uses deep learning to generate human-like, conversational text. As a multimodal model, GPT-4 is able to accept both text and image outputs.
However, OpenAI has not yet made the GPT-4 model's visual input capabilities available through any platform. Currently the only way to access the text-input capability through OpenAI is with a subscription to ChatGPT Plus.
The GPT-4 model is optimized for conversational interfaces and can be used to generate text summaries, reports, and responses. Currently, only text modality is supported.
Performance
GPT-4 is a highly advanced model that can accept both image and text inputs, making it more versatile than its predecessor, GPT-3. However, it is important to use the appropriate techniques to get the best results, as the model behaves differently than older GPT models.
OpenAI published results for the GPT-4 model comparing it to other state-of-the-art models (SOTA) including its previous GPT-3.5 model.
Benchmark | GPT-4 Evaluated few-shot | GPT-3.5 Evaluated few-shot | LM SOTA Best external LM evaluated few-shot | SOTA Best external model (includes benchmark-specific training) |
MMLU Multiple-choice questions in 57 subjects (professional & academic) | 86.4% 5-shot | 70.0% 5-shot | 70.7% 5-shot U-PaLM | 75.2% 5-shot Flan-PaLM |
HellaSwag Commonsense reasoning around everyday events | 95.3% 10-shot | 85.5% 10-shot | 84.2% LLAMA (validation set) | 85.6% ALUM |
AI2 Reasoning Challenge (ARC) Grade-school multiple choice science questions. Challenge-set. | 96.3% 25-shot | 85.2% 25-shot | 84.2% 8-shot PaLM | 85.6% ST-MOE |
WinoGrande Commonsense reasoning around pronoun resolution | 87.5% 5-shot | 81.6% 5-shot | 84.2% 5-shot PALM | 85.6% 5-shot PALM |
HumanEval Python coding tasks | 67.0% 0-shot | 48.1% 0-shot | 26.2% 0-shot PaLM | 65.8% CodeT + GPT-3.5 |
DROP (f1 score) Reading comprehension & arithmetic. | 80.9 3-shot | 64.1 3-shot | 70.8 1-shot PaLM | 88.4 QDGAT |
Limitations
The underlying format of the GPT-4 model is more likely to change over time, and it may provide less useful responses if interacted with in the same way as older models. The GPT-4 model has similar limitations to previous GPT models, such as being prone to LLM hallucination and reasoning errors. OpenAI claims that GPT-4 hallucinates less often than other models, regardless.
Limitations
https://openai.com/research/gpt-4
CLIP ViT LAION (Classification)
Zero shot image classification. The model will produce exactly one prediction per classification task unless the max classification score is less than the confidence threshold.
Intended Use
Zero shot image classification. The model will produce exactly one prediction per classification task unless the max classification score is less than the confidence threshold.
A CLIP ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
OpenAI GPT-3.5 Turbo
ChatGPT is an advanced conversational artificial intelligence language model developed by OpenAI. It is based on the GPT-3.5 architecture and has been trained on a diverse range of internet text to generate human-like responses in natural language conversations. This model is latest version.
Intended Use
The primary intended use of ChatGPT-3.5 is to provide users with a conversational AI system that can assist with a wide range of language-based tasks. It is designed to engage in interactive conversations, answer questions, provide explanations, offer suggestions, and facilitate information retrieval. It can also engage in more specialized conversations such as prose writing, programming, script and dialogues, and explaining scientific concepts to varying degrees of complexity.
ChatGPT-3.5 can also be employed in customer service applications, virtual assistants, educational tools, and other systems that require natural language understanding and generation.
Performance
ChatGPT-3.5 has demonstrated strong performance across various language tasks, including understanding and generating text in a conversational context. It is capable of producing coherent and contextually relevant responses to user input as well as storing information short-term to offer meaningful information and engage in meaningful dialogue with a user.
The model has been trained on a vast amount of internet text, enabling it to leverage a wide range of knowledge and information. OpenAI has
However, it is important to note that ChatGPT-3.5 may occasionally produce incorrect or nonsensical answers, especially when presented with ambiguous queries or lacking relevant context.
Limitations
While ChatGPT-3.5 exhibits impressive capabilities, it also has certain limitations that users should be aware of:
Lack of Real-Time Information: ChatGPT-3.5’s training data is current until September 2021. Therefore, it may not be aware of recent events or have access to real-time information. Consequently, it may provide outdated or inaccurate responses to queries related to current affairs or time-sensitive topics.
Sensitivity to Input Phrasing: ChatGPT-3.5 is sensitive to slight rephrasing of questions or prompts. While it strives to generate consistent responses, minor changes in phrasing can sometimes lead to different answers or interpretations. Users should be mindful of this when interacting with ChatGPT.
Propensity for Biases: ChatGPT-3.5 is trained on a broad range of internet text, which may include biased or objectionable content. While efforts have been made to mitigate biases during training, the model may still exhibit some biases or respond to sensitive topics inappropriately. It is important to use ChatGPT's responses critically and be aware of potential biases.
Inability to Verify Information: ChatGPT-3.5 does not have the capability to verify the accuracy or truthfulness of the information it generates. It relies solely on patterns in the training data and may occasionally provide incorrect or misleading information. Users are encouraged to independently verify any critical or factual information obtained from the model.
Lack of Context Awareness: Although ChatGPT-3.5 can maintain short-term context within a conversation, it lacks long-term memory. Consequently, it may sometimes provide inconsistent or contradictory answers within the same conversation. Users should ensure they provide sufficient context to minimize potential misunderstandings.
LLM Hallucation: ChatGPT-3.5, much like many other large language models, is prone to a phenomenon called “LLM Hallucation”. At its core, GPT3.5 and other LLMs are neural networks trained on a large amount of text data. It is a statistical machine, and essentially "learns" to predict the next word in a sentence based on the context provided by preceding words. As a result, LLM hallucination occurs when the model's primary objective is to generate text that is coherent and contextually appropriate, rather than factually accurate.
In addition, OpenAI has since released GPT-4, which makes significant improvements on the GPT-3.5 architecture. From parameters, to memory, and now accepting both text and image data compared to GPT-3.5, GPT-4.0 is a significant improvement.
Citation
https://platform.openai.com/docs/models