Multimodal-reasoning
Last updated: December 9, 2024The Labelbox multimodal reasoning leaderboard evaluates AI models based on their ability to mimic human-like understanding and decision making. The leaderboard evaluates leading models on their abilities to conduct logical storytelling, detect differences in images, generate image captions, and perform spatial reasoning.
Spatial
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
Claude 3.5 sonnet | 2262.17 | 74.50% | |||
2 | Gemini 1.5 pro | 2028.45 | 51.50% | ||
3 | Pixtral large | 2022.38 | 49.50% | ||
4 | Gpt4o | 2006.66 | 52.00% | ||
5 | Llama 3.2 90B | 1680.35 | 22.50% |
Captioning
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
Claude 3.5 sonnet | 2229.25 | 70.00% | |||
2 | Pixtral large | 2154.86 | 64.50% | ||
3 | Gemini 1.5 pro | 2088.74 | 55.00% | ||
4 | Llama 3.2 90B | 1771.15 | 29.00% | ||
5 | Gpt4o | 1756.00 | 31.50% |
Differences
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
Pixtral large | 2331.12 | 74.50% | |||
2 | Claude 3.5 sonnet | 2282.23 | 71.00% | ||
3 | Gpt4o | 2066.67 | 54.00% | ||
4 | Gemini 1.5 pro | 1898.22 | 42.00% | ||
5 | Llama 3.2 90B | 1421.76 | 8.50% |
Storytelling
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
Gpt4o | 2226.79 | 66.00% | |||
2 | Pixtral large | 2152.51 | 63.00% | ||
3 | Claude 3.5 sonnet | 2131.37 | 62.50% | ||
4 | Gemini 1.5 pro | 2110.34 | 52.50% | ||
5 | Llama 3.2 90B | 1378.99 | 6.00% |
Human preference evaluation
Diverse pool of US-based Alignerrs, including generalists and creative artists
Consensus of three Alignerrs per task
Standardized instructions and ontology for consistent evaluations
Carefully curated prompt generation process, balancing creativity and clarity
Storytelling
Description:
Assess how effectively specific visual elements are incorporated
Options:
5
4
3
2
1
Differences
Description:
Evaluate accuracy and clarity of spatial difference descriptions
Options:
5
4
3
2
1
Captioning
Description:
Assess grammar, natural flow, and word choice
Options:
5
4
3
2
1
Spatial
Description:
Assess how well spatial relationships with other objects are described
Options:
5
4
3
2
1