Models
Claude 3.5 Haiku
Claude 3.5 Haiku, the next generation of Anthropic's fastest and most cost-effective model, is optimal for use cases where speed and affordability matter. It improves on its predecessor across every skill set.
Intended Use
Claude 3.5 Haiku offers fast speeds, improved instruction-following, and accurate tool use, making it ideal for user-facing products and personalized experiences.
Key use cases include code completion, streamlining development workflows with quick, accurate code suggestions. It powers interactive chatbots for customer service, e-commerce, and education, handling high user interaction volumes. It excels at data extraction and labeling, processing large datasets in sectors like finance and healthcare. Additionally, it provides real-time content moderation for safe online environments.
Performance
![](http://images.ctfassets.net/j20krz61k3rk/75pbhfj3wBI5MWufwkPEbQ/6bd03c7f0532cd46317be0054b15d6c8/_claude2.png)
Limitations
Context: May struggle with maintaining context over extended conversations, leading to inconsistencies in long interactions.
Bias: As it is trained on a large corpus of internet text, it may inadvertently reflect and perpetuate biases present in the training data.
Creativity Boundaries: While capable of creative outputs, it may not always meet specific creative standards or expectations for novel and nuanced content.
Ethical Concerns: Can be used to generate misleading information, offensive content, or be exploited for harmful purposes if not properly moderated.
Comprehension: Might not fully understand or accurately interpret highly technical or domain-specific content, especially if it involves recent developments post-training data cutoff.
Dependence on Prompt Quality: The quality and relevance of the output are highly dependent on the clarity and specificity of the input prompts provided by the user.
Citation
https://docs.anthropic.com/claude/docs/models-overview
BLIP2 (blip2-flan-t5-xxl)
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. This model is the BLIP-2, Flan-T5-XXL variant.
Intended Use
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering. It can also be used for chat-like conversations by feeding the image and the previous conversation as prompt to the model.
Performance
Best performance within the BLIP2 family of models.
Limitations
BLIP2 is fine-tuned on image-text datasets (e.g. LAION) collected from the internet. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data. Other limitations include:
May struggle with highly abstract or culturally specific visual concepts
Performance can vary based on image quality and complexity
Limited by the training data of its component models (vision encoder and language model)
Cannot generate or edit images (only processes and describes them)
Requires careful prompt engineering for optimal performance in some tasks
Citation
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
Grounding Dino + SAM
Grounding Dino + SAM, or Grounding SAM, uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models by enabling users to create segmentation masks quickly.
Intended Use
Create segmentation masks using SAM and classify the masks using Grounding Dino. The masks are intended to be used as pre-labels.
Limitations
Inaccurate classification might occur, especially for aerial images for classification like roof and solar panels.
The accuracy of masks is suboptimal in areas with complex shapes, low contrast zones, and small objects.
Citation
Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others. (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499
Chen, Jiaqi and Yang, Zeyu and Zhang, Li. (2023). Semantic Segment Anything. https://github.com/fudan-zvg/Semantic-Segment-Anything
Google Gemini 1.5 Pro
Google Gemini 1.5 Pro is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.
Intended Use
Gemini 1.5, the next-generation model from Google, offers significant performance enhancements. Built on a Mixture-of-Experts (MoE) architecture, it delivers improved efficiency in training and serving. Gemini 1.5 Pro, the mid-size multimodal model, exhibits comparable quality to the larger 1.0 Ultra, featuring a breakthrough experimental long-context understanding feature. With a context window of up to 1 million tokens, it can process vast amounts of information, including:
1 hour of video
11 hours of audio
over 700,000 words
Use cases
Gemini is good at a wide variety of multimodal use cases, including but not limited to:
Info Seeking: Fusing world knowledge with information extracted from the images and videos.
Object Recognition: Answering questions related to fine-grained identification of the objects in images and videos.
Digital Content Understanding: Answering questions and extracting information from various contents like infographics, charts, figures, tables, and web pages.
Structured Content Generation: Generating responses in formats like HTML and JSON, based on provided prompt instructions.
Captioning / Description: Generating descriptions of images and videos with varying levels of detail. For example, for images, the prompt can be: “Can you write a description about the image?”. For videos, the prompt can be: “Can you write a description about what's happening in this video?
Extrapolations: Suggesting what else to see based on location, what might happen next/before/between images or videos, and enabling creative uses like writing stories based on visual inputs.
Limitations
Here are some of the limitations we are aware of:
Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Hallucinations: the model can provide factually inaccurate information.
Counting: Gemini Pro may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, Gemini Pro has a system to block the submission of CAPTCHAs.
Multi-turn (multimodal) chat: Not trained for chatbot functionality or answering questions in a chatty tone, and can perform less effectively in multi-turn conversations.
Following complex instructions: Can struggle with tasks requiring multiple reasoning steps. Consider breaking down instructions or providing few-shot examples for better guidance.
Counting: Can only provide rough approximations of object counts, especially for obscured objects.
Spatial reasoning: Can struggle with precise object/text localization in images. It may be less accurate in understanding rotated images.
Citation
https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note
Google Gemini Pro
Google Gemini is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.
Intended Use
Gemini is designed to process and reason across different inputs like text, images, video, and code. On Labelbox platform, Gemini supports wide range of image and language tasks such as text generation, question answering, classification, visual understanding, answering questions about math, etc.
Performance
Gemini is Google’s largest and most capable model to date. It is the first AI model to surpass human experts on the Massive Multitask Language Understanding (MMLU) benchmark, and supposed SOTA performances on multi-modal tasks.
Limitations
There is a continued need for research and development on reducing “hallucinations” generated by LLMs. LLMs also struggle with tasks requiring high level reasoning abilities such as casual understanding, logical deduction, and counterfactual reasoning.
Citation
Technical report on Gemini: a Family of Highly Capable Multimodal Models
Tesseract OCR
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google (Source)
Intended Use
This model uses Tesseract (https://github.com/tesseract-ocr/tesseract) for OCR, and writes the output as a text annotation.
Performance
Tesseract works best with straight, well scanned text. For text in the wild, handwriting and other use cases, other models should be used.
Citation
https://github.com/tesseract-ocr/tesseract
Google Imagen
Google's Imagen generates a relevant description or caption for a given image.
Intended Use
The model generates image captioning, allowing users to generate a relevant description for an image. You can use this information for a variety of use cases:
Creators can generate captions for uploaded images
Generate captions to describe products
Integrate Imagen captioning with an app using the API to create new experiences
Imagen currently supports five languages: English, German, French, Spanish and Italian.
Performance
The Imagen model has reported to achieve high accuracy, however may have limitations in generating captions for complex or abstract images. The model may also generate captions that reflect biases present in the training data.
Citations
Google Image captioning documentation Google visual question answering documentation
Grounding DINO
Open-set object detector that by combines a Transformer-based detector DINO with grounded pre-training. It can detect arbitrary objects with human inputs such as category names or referring expressions.
Intended Use
Useful for zero shot object detection tasks.
Performance
Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP
Citation
OpenAI GPT4
ChatGPT is an advanced conversational artificial intelligence language model developed by OpenAI. This It is based on the GPT-4 architecture and has been trained on a diverse range of internet text to generate human-like responses in natural language conversations. This model is latest version.
Intended Use
GPT stands for Generative Pre-trained Transformer (GPT), a type of language model that uses deep learning to generate human-like, conversational text. As a multimodal model, GPT-4 is able to accept both text and image outputs.
However, OpenAI has not yet made the GPT-4 model's visual input capabilities available through any platform. Currently the only way to access the text-input capability through OpenAI is with a subscription to ChatGPT Plus.
The GPT-4 model is optimized for conversational interfaces and can be used to generate text summaries, reports, and responses. Currently, only text modality is supported.
Performance
GPT-4 is a highly advanced model that can accept both image and text inputs, making it more versatile than its predecessor, GPT-3. However, it is important to use the appropriate techniques to get the best results, as the model behaves differently than older GPT models.
OpenAI published results for the GPT-4 model comparing it to other state-of-the-art models (SOTA) including its previous GPT-3.5 model.
Benchmark | GPT-4 Evaluated few-shot | GPT-3.5 Evaluated few-shot | LM SOTA Best external LM evaluated few-shot | SOTA Best external model (includes benchmark-specific training) |
MMLU Multiple-choice questions in 57 subjects (professional & academic) | 86.4% 5-shot | 70.0% 5-shot | 70.7% 5-shot U-PaLM | 75.2% 5-shot Flan-PaLM |
HellaSwag Commonsense reasoning around everyday events | 95.3% 10-shot | 85.5% 10-shot | 84.2% LLAMA (validation set) | 85.6% ALUM |
AI2 Reasoning Challenge (ARC) Grade-school multiple choice science questions. Challenge-set. | 96.3% 25-shot | 85.2% 25-shot | 84.2% 8-shot PaLM | 85.6% ST-MOE |
WinoGrande Commonsense reasoning around pronoun resolution | 87.5% 5-shot | 81.6% 5-shot | 84.2% 5-shot PALM | 85.6% 5-shot PALM |
HumanEval Python coding tasks | 67.0% 0-shot | 48.1% 0-shot | 26.2% 0-shot PaLM | 65.8% CodeT + GPT-3.5 |
DROP (f1 score) Reading comprehension & arithmetic. | 80.9 3-shot | 64.1 3-shot | 70.8 1-shot PaLM | 88.4 QDGAT |
Limitations
The underlying format of the GPT-4 model is more likely to change over time, and it may provide less useful responses if interacted with in the same way as older models. The GPT-4 model has similar limitations to previous GPT models, such as being prone to LLM hallucination and reasoning errors. OpenAI claims that GPT-4 hallucinates less often than other models, regardless.
Limitations
https://openai.com/research/gpt-4
CLIP ViT LAION (Classification)
Zero shot image classification. The model will produce exactly one prediction per classification task unless the max classification score is less than the confidence threshold.
Intended Use
Zero shot image classification. The model will produce exactly one prediction per classification task unless the max classification score is less than the confidence threshold.
A CLIP ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
OpenAI GPT-3.5 Turbo
ChatGPT is an advanced conversational artificial intelligence language model developed by OpenAI. It is based on the GPT-3.5 architecture and has been trained on a diverse range of internet text to generate human-like responses in natural language conversations. This model is latest version.
Intended Use
The primary intended use of ChatGPT-3.5 is to provide users with a conversational AI system that can assist with a wide range of language-based tasks. It is designed to engage in interactive conversations, answer questions, provide explanations, offer suggestions, and facilitate information retrieval. It can also engage in more specialized conversations such as prose writing, programming, script and dialogues, and explaining scientific concepts to varying degrees of complexity.
ChatGPT-3.5 can also be employed in customer service applications, virtual assistants, educational tools, and other systems that require natural language understanding and generation.
Performance
ChatGPT-3.5 has demonstrated strong performance across various language tasks, including understanding and generating text in a conversational context. It is capable of producing coherent and contextually relevant responses to user input as well as storing information short-term to offer meaningful information and engage in meaningful dialogue with a user.
The model has been trained on a vast amount of internet text, enabling it to leverage a wide range of knowledge and information. OpenAI has
However, it is important to note that ChatGPT-3.5 may occasionally produce incorrect or nonsensical answers, especially when presented with ambiguous queries or lacking relevant context.
Limitations
While ChatGPT-3.5 exhibits impressive capabilities, it also has certain limitations that users should be aware of:
Lack of Real-Time Information: ChatGPT-3.5’s training data is current until September 2021. Therefore, it may not be aware of recent events or have access to real-time information. Consequently, it may provide outdated or inaccurate responses to queries related to current affairs or time-sensitive topics.
Sensitivity to Input Phrasing: ChatGPT-3.5 is sensitive to slight rephrasing of questions or prompts. While it strives to generate consistent responses, minor changes in phrasing can sometimes lead to different answers or interpretations. Users should be mindful of this when interacting with ChatGPT.
Propensity for Biases: ChatGPT-3.5 is trained on a broad range of internet text, which may include biased or objectionable content. While efforts have been made to mitigate biases during training, the model may still exhibit some biases or respond to sensitive topics inappropriately. It is important to use ChatGPT's responses critically and be aware of potential biases.
Inability to Verify Information: ChatGPT-3.5 does not have the capability to verify the accuracy or truthfulness of the information it generates. It relies solely on patterns in the training data and may occasionally provide incorrect or misleading information. Users are encouraged to independently verify any critical or factual information obtained from the model.
Lack of Context Awareness: Although ChatGPT-3.5 can maintain short-term context within a conversation, it lacks long-term memory. Consequently, it may sometimes provide inconsistent or contradictory answers within the same conversation. Users should ensure they provide sufficient context to minimize potential misunderstandings.
LLM Hallucation: ChatGPT-3.5, much like many other large language models, is prone to a phenomenon called “LLM Hallucation”. At its core, GPT3.5 and other LLMs are neural networks trained on a large amount of text data. It is a statistical machine, and essentially "learns" to predict the next word in a sentence based on the context provided by preceding words. As a result, LLM hallucination occurs when the model's primary objective is to generate text that is coherent and contextually appropriate, rather than factually accurate.
In addition, OpenAI has since released GPT-4, which makes significant improvements on the GPT-3.5 architecture. From parameters, to memory, and now accepting both text and image data compared to GPT-3.5, GPT-4.0 is a significant improvement.
Citation
https://platform.openai.com/docs/models