logo

GPT-4V(ision)

Text generation
Image classification

GPT-4 with Vision, sometimes referred to as GPT-4V or gpt-4-vision-preview in the API, allows the model to take in images and answer questions about them.

Intended Use

Historically, language model systems have been limited by taking in a single input modality, text. For many use cases, this constrained the areas where models like GPT-4 could be used. 

This marks GPT-4's move into being a multimodal model, meaning that it can now accept multiple types of "modalities" in its inputs–namely text and images.


Performance

GPT-4V is an upgrade over its predecessor, GPT3.5, as it now introduces the ability to accept prompts of texts, images, or both. OpenAI published results for the GPT-4 model comparing it to a suite of standard academic vision benchmarks.


Benchmark

GPT-4

Evaluated few-shot

Few-shot SOTA

SOTA

Best external model (includes benchmark-specific training)

VQAv2

VQA score (test-dev)

77.2%

0-shot

67.6%

Flamingo 32-shot

84.3%

PaLI-17B

TextVQA

VQA score (val)

78.0%

0-shot

37.9%

Flamingo 32-shot

71.8%

PaLI-17B

ChartQA

Relaxed accuracy (test)

78.5%

-

58.6%

Pix2Struct Large

AI2 Diagram (AI2D)

Accuracy (test)

78.2%

0-shot

-

42.1%

Pix2Struct Large

DocVQA

ANLS score (test)

88.4%

0-shot

-

88.4%

ERNIE-Layout 2.0

Infographic VQA

ANLS score (test)

75.1%

0-shot

-

61.2%

Applica.ai TILT

TVQA

Accuracy (val)

87.3%

0-shot

-

86.5%

MERLOT Reserve Large

LSMDC

Fill-in-the-blank accuracy (test)

45.7%

0-shot

31.0%

MERLOT Reserve Large

52.9%

MERLOT


Limitations

While GPT-4V is powerful and can be used in many situations, it is important to understand the limitations of the model. Here are some of the limitations we are aware of:

  • Medical images: GPT-4V is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.

  • Non-English: GPT-4V may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.

  • Big text: Users should enlarge text within the image to improve readability for GPT-4V, but avoid cropping important details.

  • Rotation: GPT-4V may misinterpret rotated / upside-down text or images.

  • Visual elements: GPT-4V may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.

  • Spatial reasoning: GPT-4V struggles with tasks requiring precise spatial localization, such as identifying chess positions.

  • Accuracy: GPT-4V may generate incorrect descriptions or captions in certain scenarios.

  • Image shape: GPT-4V struggles with panoramic and fisheye images.

  • Metadata and resizing: GPT-4V doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.

  • Counting: GPT-4V may give approximate counts for objects in images.

  • CAPTCHAS: For safety reasons, GPT-4V has a system to block the submission of CAPTCHAs.


Citation

https://platform.openai.com/docs/guides/vision

https://openai.com/contributions/gpt-4v


Data types supported for GPT4V in Labelbox

Base datarow types supported are: image, html, conversational, pdf

Supported attachment types are: image, html, text, pdf