GPT-4V(ision)

Text generation

Image classification

GPT-4 with Vision, sometimes referred to as GPT-4V or gpt-4-vision-preview in the API, allows the model to take in images and answer questions about them.

Intended Use

Historically, language model systems have been limited by taking in a single input modality, text. For many use cases, this constrained the areas where models like GPT-4 could be used.

This marks GPT-4's move into being a multimodal model, meaning that it can now accept multiple types of "modalities" in its inputs–namely text and images.

Performance

GPT-4V is an upgrade over its predecessor, GPT3.5, as it now introduces the ability to accept prompts of texts, images, or both. OpenAI published results for the GPT-4 model comparing it to a suite of standard academic vision benchmarks.

Benchmark

GPT-4

Evaluated few-shot

Few-shot SOTA

SOTA

Best external model (includes benchmark-specific training)

VQAv2

VQA score (test-dev)

77.2%

0-shot

67.6%

Flamingo 32-shot

84.3%

PaLI-17B

TextVQA

VQA score (val)

78.0%

0-shot

37.9%

Flamingo 32-shot

71.8%

PaLI-17B

ChartQA

Relaxed accuracy (test)

78.5%

58.6%

Pix2Struct Large

AI2 Diagram (AI2D)

Accuracy (test)

78.2%

0-shot

42.1%

Pix2Struct Large

DocVQA

ANLS score (test)

88.4%

0-shot

88.4%

ERNIE-Layout 2.0

Infographic VQA

ANLS score (test)

75.1%

0-shot

61.2%

Applica.ai TILT

TVQA

Accuracy (val)

87.3%

0-shot

86.5%

MERLOT Reserve Large

LSMDC

Fill-in-the-blank accuracy (test)

45.7%

0-shot

31.0%

MERLOT Reserve Large

52.9%

MERLOT

Limitations

While GPT-4V is powerful and can be used in many situations, it is important to understand the limitations of the model. Here are some of the limitations we are aware of:

Medical images: GPT-4V is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.
Non-English: GPT-4V may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Big text: Users should enlarge text within the image to improve readability for GPT-4V, but avoid cropping important details.
Rotation: GPT-4V may misinterpret rotated / upside-down text or images.
Visual elements: GPT-4V may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: GPT-4V struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Accuracy: GPT-4V may generate incorrect descriptions or captions in certain scenarios.
Image shape: GPT-4V struggles with panoramic and fisheye images.
Metadata and resizing: GPT-4V doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.
Counting: GPT-4V may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, GPT-4V has a system to block the submission of CAPTCHAs.

Citation

https://platform.openai.com/docs/guides/vision

https://openai.com/contributions/gpt-4v

Data types supported for GPT4V in Labelbox

Base datarow types supported are: image, html, conversational, pdf

Supported attachment types are: image, html, text, pdf

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free