GPT-4 with Vision, sometimes referred to as GPT-4V or gpt-4-vision-preview in the API, allows the model to take in images and answer questions about them.
Historically, language model systems have been limited by taking in a single input modality, text. For many use cases, this constrained the areas where models like GPT-4 could be used.
This marks GPT-4's move into being a multimodal model, meaning that it can now accept multiple types of "modalities" in its inputs–namely text and images.
GPT-4V is an upgrade over its predecessor, GPT3.5, as it now introduces the ability to accept prompts of texts, images, or both. OpenAI published results for the GPT-4 model comparing it to a suite of standard academic vision benchmarks.
Benchmark | GPT-4 Evaluated few-shot | Few-shot SOTA | SOTA Best external model (includes benchmark-specific training) |
VQAv2 VQA score (test-dev) | 77.2% 0-shot | 67.6% Flamingo 32-shot | 84.3% PaLI-17B |
TextVQA VQA score (val) | 78.0% 0-shot | 37.9% Flamingo 32-shot | 71.8% PaLI-17B |
ChartQA Relaxed accuracy (test) | 78.5% | - | 58.6% Pix2Struct Large |
AI2 Diagram (AI2D) Accuracy (test) | 78.2% 0-shot | - | 42.1% Pix2Struct Large |
DocVQA ANLS score (test) | 88.4% 0-shot | - | 88.4% ERNIE-Layout 2.0 |
Infographic VQA ANLS score (test) | 75.1% 0-shot | - | 61.2% Applica.ai TILT |
TVQA Accuracy (val) | 87.3% 0-shot | - | 86.5% MERLOT Reserve Large |
LSMDC Fill-in-the-blank accuracy (test) | 45.7% 0-shot | 31.0% MERLOT Reserve Large | 52.9% MERLOT |
While GPT-4V is powerful and can be used in many situations, it is important to understand the limitations of the model. Here are some of the limitations we are aware of:
Medical images: GPT-4V is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: GPT-4V may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for GPT-4V, but avoid cropping important details.
Rotation: GPT-4V may misinterpret rotated / upside-down text or images.
Visual elements: GPT-4V may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: GPT-4V struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Accuracy: GPT-4V may generate incorrect descriptions or captions in certain scenarios.
Image shape: GPT-4V struggles with panoramic and fisheye images.
Metadata and resizing: GPT-4V doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.
Counting: GPT-4V may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, GPT-4V has a system to block the submission of CAPTCHAs.
https://platform.openai.com/docs/guides/vision
https://openai.com/contributions/gpt-4v
Base datarow types supported are: image, html, conversational, pdf
Supported attachment types are: image, html, text, pdf