Google Gemini 1.5 Pro

Text generation
Question answering
Zero-shot classification
Text classification

Google Gemini 1.5 Pro is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.

Intended Use

Gemini 1.5, the next-generation model from Google, offers significant performance enhancements. Built on a Mixture-of-Experts (MoE) architecture, it delivers improved efficiency in training and serving. Gemini 1.5 Pro, the mid-size multimodal model, exhibits comparable quality to the larger 1.0 Ultra, featuring a breakthrough experimental long-context understanding feature. With a context window of up to 1 million tokens, it can process vast amounts of information, including:

  • 1 hour of video 

  • 11 hours of audio 

  • over 700,000 words

Use cases

Gemini is good at a wide variety of multimodal use cases, including but not limited to:

  • Info Seeking: Fusing world knowledge with information extracted from the images and videos.

  • Object Recognition: Answering questions related to fine-grained identification of the objects in images and videos.

  • Digital Content Understanding: Answering questions and extracting information from various contents like infographics, charts, figures, tables, and web pages.

  • Structured Content Generation: Generating responses in formats like HTML and JSON, based on provided prompt instructions.

  • Captioning / Description: Generating descriptions of images and videos with varying levels of detail. For example, for images, the prompt can be: “Can you write a description about the image?”. For videos, the prompt can be:  “Can you write a description about what's happening in this video?

  • Extrapolations: Suggesting what else to see based on location, what might happen next/before/between images or videos, and enabling creative uses like writing stories based on visual inputs.


Here are some of the limitations we are aware of:

  • Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.

  • Hallucinations: the model can provide factually inaccurate information.

  • Counting: Gemini Pro may give approximate counts for objects in images.

  • CAPTCHAS: For safety reasons, Gemini Pro has a system to block the submission of CAPTCHAs.

  • Multi-turn (multimodal) chat: Not trained for chatbot functionality or answering questions in a chatty tone, and can perform less effectively in multi-turn conversations.

  • Following complex instructions: Can struggle with tasks requiring multiple reasoning steps. Consider breaking down instructions or providing few-shot examples for better guidance.

  • Counting: Can only provide rough approximations of object counts, especially for obscured objects.

  • Spatial reasoning: Can struggle with precise object/text localization in images. It may be less accurate in understanding rotated images.