Google Gemini 1.5 Pro
Google Gemini 1.5 Pro is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.
Intended Use
Gemini 1.5, the next-generation model from Google, offers significant performance enhancements. Built on a Mixture-of-Experts (MoE) architecture, it delivers improved efficiency in training and serving. Gemini 1.5 Pro, the mid-size multimodal model, exhibits comparable quality to the larger 1.0 Ultra, featuring a breakthrough experimental long-context understanding feature. With a context window of up to 1 million tokens, it can process vast amounts of information, including:
1 hour of video
11 hours of audio
over 700,000 words
Use cases
Gemini is good at a wide variety of multimodal use cases, including but not limited to:
Info Seeking: Fusing world knowledge with information extracted from the images and videos.
Object Recognition: Answering questions related to fine-grained identification of the objects in images and videos.
Digital Content Understanding: Answering questions and extracting information from various contents like infographics, charts, figures, tables, and web pages.
Structured Content Generation: Generating responses in formats like HTML and JSON, based on provided prompt instructions.
Captioning / Description: Generating descriptions of images and videos with varying levels of detail. For example, for images, the prompt can be: “Can you write a description about the image?”. For videos, the prompt can be: “Can you write a description about what's happening in this video?
Extrapolations: Suggesting what else to see based on location, what might happen next/before/between images or videos, and enabling creative uses like writing stories based on visual inputs.
Limitations
Here are some of the limitations we are aware of:
Medical images: Gemini Pro is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Hallucinations: the model can provide factually inaccurate information.
Counting: Gemini Pro may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, Gemini Pro has a system to block the submission of CAPTCHAs.
Multi-turn (multimodal) chat: Not trained for chatbot functionality or answering questions in a chatty tone, and can perform less effectively in multi-turn conversations.
Following complex instructions: Can struggle with tasks requiring multiple reasoning steps. Consider breaking down instructions or providing few-shot examples for better guidance.
Counting: Can only provide rough approximations of object counts, especially for obscured objects.
Spatial reasoning: Can struggle with precise object/text localization in images. It may be less accurate in understanding rotated images.
Citation
https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note