Multimodal-reasoning
logo
Leaderboards

Multimodal-reasoning

Last updated: December 9, 2024

The Labelbox multimodal reasoning leaderboard evaluates AI models based on their ability to mimic human-like understanding and decision making. The leaderboard evaluates leading models on their abilities to conduct logical storytelling, detect differences in images, generate image captions, and perform spatial reasoning.

Spatial

Rank ↓ModelElo rating Win Rate
1Claude 3.5 sonnet2262.1774.50%
2
Gemini 1.5 pro2028.4551.50%
3
Pixtral large2022.3849.50%
4
Gpt4o2006.6652.00%
5
Llama 3.2 90B1680.3522.50%

Captioning

Rank ↓ModelElo rating Win Rate
1Claude 3.5 sonnet2229.2570.00%
2
Pixtral large2154.8664.50%
3
Gemini 1.5 pro2088.7455.00%
4
Llama 3.2 90B1771.1529.00%
5
Gpt4o1756.0031.50%

Differences

Rank ↓ModelElo rating Win Rate
1Pixtral large2331.1274.50%
2
Claude 3.5 sonnet2282.2371.00%
3
Gpt4o2066.6754.00%
4
Gemini 1.5 pro1898.2242.00%
5
Llama 3.2 90B1421.768.50%

Storytelling

Rank ↓ModelElo rating Win Rate
1Gpt4o2226.7966.00%
2
Pixtral large2152.5163.00%
3
Claude 3.5 sonnet2131.3762.50%
4
Gemini 1.5 pro2110.3452.50%
5
Llama 3.2 90B1378.996.00%

Human preference evaluation

Diverse pool of US-based Alignerrs, including generalists and creative artists

Consensus of three Alignerrs per task

Standardized instructions and ontology for consistent evaluations

Carefully curated prompt generation process, balancing creativity and clarity

Storytelling

Detail
Coherence
Creativity
Claude 3.5 sonnetPixtral largeGpt4oGemini 1.5 proLlama 3.2 90B

Description:

Assess how effectively specific visual elements are incorporated

Options:

5

4

3

2

1

Differences

Precision
Organization
Comprehensiveness
Pixtral largeClaude 3.5 sonnetGemini 1.5 proGpt4oLlama 3.2 90B

Description:

Evaluate accuracy and clarity of spatial difference descriptions

Options:

5

4

3

2

1

Captioning

Language
Creativity
Completeness
Claude 3.5 sonnetPixtral largeGemini 1.5 proLlama 3.2 90BGpt4o

Description:

Assess grammar, natural flow, and word choice

Options:

5

4

3

2

1

Spatial

Clarity
Distances
Precision
Claude 3.5 sonnetGpt4oGemini 1.5 proPixtral largeLlama 3.2 90B

Description:

Assess how well spatial relationships with other objects are described

Options:

5

4

3

2

1