Google Gemini Pro Vision

Text generation

Google Gemini Pro Vision was created from the ground up to be multimodal (text, images, videos) and to scale across a wide range of tasks.

Intended Use

Gemini Pro Vision is a Gemini large language vision model that understands input from text and visual modalities (image and video) in addition to text to generate relevant text responses.

Gemini Pro Vision is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots.

Use cases

  1. Visual information seeking: Use external knowledge combined with information extracted from the input image or video to answer questions.

  2. Object recognition: Answer questions related to fine-grained identification of the objects in images and videos.

  3. Digital content understanding: Answer questions and extract information from visual content like infographics, charts, figures, tables, and web pages.

  4. Structured content generation: Generate responses based on multimodal inputs in formats like HTML and JSON.

  5. Captioning and description: Generate descriptions of images and videos with varying levels of details.

  6. Reasoning: Compositionally infer new information without memorization or retrieval.


Learn more

Privacy policy