logo
Leaderboards

Multimodal-reasoning

Last updated: March 12, 2025

The Labelbox multimodal reasoning leaderboard evaluates AI models based on their ability to mimic human-like understanding and decision making. The leaderboard evaluates leading models on their abilities to conduct logical storytelling, detect differences in images, generate image captions, and perform spatial reasoning.

Spatial

Rank ↓ModelElo rating Win Rate
1Claude 3.5 Sonnet1220.0075.56%
2
Gemini 2.0 Flash1119.6261.98%
3
O11105.5658.89%
4
Pixtral Large1074.2160.99%
5
Gemini 1.5 Pro1007.7050.88%
6
GPT-4o990.7752.97%
7
AWS Nova Pro747.6115.94%
8
Llama 3.2 90B734.5225.00%

Captioning

Rank ↓ModelElo rating Win Rate
1Pixtral Large1144.5368.05%
2
Claude 3.5 Sonnet1114.5368.40%
3
Gemini 1.5 Pro1111.4062.78%
4
AWS Nova Pro1028.5867.39%
5
Llama 3.2 90B986.6035.97%
6
Gemini 2.0 Flash950.5041.13%
7
GPT-4o928.2437.45%
8
O1735.6215.45%

Differences

Rank ↓ModelElo rating Win Rate
1Pixtral Large1172.8373.08%
2
O11172.8076.49%
3
Gemini 2.0 Flash1132.2360.48%
4
Claude 3.5 Sonnet1070.3349.78%
5
GPT-4o1056.7844.00%
6
AWS Nova Pro880.6142.26%
7
Gemini 1.5 Pro870.0337.93%
8
Llama 3.2 90B644.3911.26%

Storytelling

Rank ↓ModelElo rating Win Rate
1O11303.6089.74%
2
Gemini 2.0 Flash1188.2070.61%
3
Pixtral Large1006.6852.24%
4
Gemini 1.5 Pro995.7545.49%
5
Claude 3.5 Sonnet989.7143.15%
6
AWS Nova Pro903.0242.13%
7
GPT-4o895.5342.86%
8
Llama 3.2 90B717.5112.86%

Human preference evaluation

Diverse pool of US-based Alignerrs, including generalists and creative artists

Consensus of three Alignerrs per task

Standardized instructions and ontology for consistent evaluations

Carefully curated prompt generation process, balancing creativity and clarity

Storytelling

Coherence
O1Gemini 2.0 FlashPixtral LargeGPT-4oClaude 3.5 SonnetGemini 1.5 ProAWS Nova ProLlama 3.2 90B

Description:

Assess how well the story logically connects all three images

Options:

High

Medium

Low

Differences

Comprehensiveness
Pixtral LargeO1Gemini 2.0 FlashGemini 1.5 ProGPT-4oClaude 3.5 SonnetAWS Nova ProLlama 3.2 90B

Description:

Evaluate how completely the response identifies meaningful differences

Options:

High

Medium

Low

Captioning

Completeness
AWS Nova ProClaude 3.5 SonnetPixtral LargeGemini 1.5 ProGemini 2.0 FlashLlama 3.2 90BGPT-4oO1

Description:

Assess how well the caption captures all key elements in the image

Options:

High

Medium

Low

Spatial

Precision
O1Claude 3.5 SonnetGemini 2.0 FlashPixtral LargeGemini 1.5 ProLlama 3.2 90BAWS Nova ProGPT-4o

Description:

Evaluate how accurately the target class's location is described

Options:

High

Medium

Low