Multimodal-reasoning
Last updated: March 12, 2025The Labelbox multimodal reasoning leaderboard evaluates AI models based on their ability to mimic human-like understanding and decision making. The leaderboard evaluates leading models on their abilities to conduct logical storytelling, detect differences in images, generate image captions, and perform spatial reasoning.
Spatial
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
Claude 3.5 Sonnet | 1220.00 | 75.56% | |||
2 | Gemini 2.0 Flash | 1119.62 | 61.98% | ||
3 | O1 | 1105.56 | 58.89% | ||
4 | Pixtral Large | 1074.21 | 60.99% | ||
5 | Gemini 1.5 Pro | 1007.70 | 50.88% | ||
6 | GPT-4o | 990.77 | 52.97% | ||
7 | AWS Nova Pro | 747.61 | 15.94% | ||
8 | Llama 3.2 90B | 734.52 | 25.00% |
Captioning
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
Pixtral Large | 1144.53 | 68.05% | |||
2 | Claude 3.5 Sonnet | 1114.53 | 68.40% | ||
3 | Gemini 1.5 Pro | 1111.40 | 62.78% | ||
4 | AWS Nova Pro | 1028.58 | 67.39% | ||
5 | Llama 3.2 90B | 986.60 | 35.97% | ||
6 | Gemini 2.0 Flash | 950.50 | 41.13% | ||
7 | GPT-4o | 928.24 | 37.45% | ||
8 | O1 | 735.62 | 15.45% |
Differences
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
Pixtral Large | 1172.83 | 73.08% | |||
2 | O1 | 1172.80 | 76.49% | ||
3 | Gemini 2.0 Flash | 1132.23 | 60.48% | ||
4 | Claude 3.5 Sonnet | 1070.33 | 49.78% | ||
5 | GPT-4o | 1056.78 | 44.00% | ||
6 | AWS Nova Pro | 880.61 | 42.26% | ||
7 | Gemini 1.5 Pro | 870.03 | 37.93% | ||
8 | Llama 3.2 90B | 644.39 | 11.26% |
Storytelling
Rank ↓ | Model | Elo rating | Win Rate | ||
---|---|---|---|---|---|
O1 | 1303.60 | 89.74% | |||
2 | Gemini 2.0 Flash | 1188.20 | 70.61% | ||
3 | Pixtral Large | 1006.68 | 52.24% | ||
4 | Gemini 1.5 Pro | 995.75 | 45.49% | ||
5 | Claude 3.5 Sonnet | 989.71 | 43.15% | ||
6 | AWS Nova Pro | 903.02 | 42.13% | ||
7 | GPT-4o | 895.53 | 42.86% | ||
8 | Llama 3.2 90B | 717.51 | 12.86% |
Human preference evaluation
Diverse pool of US-based Alignerrs, including generalists and creative artists
Consensus of three Alignerrs per task
Standardized instructions and ontology for consistent evaluations
Carefully curated prompt generation process, balancing creativity and clarity
Storytelling
Description:
Assess how well the story logically connects all three images
Options:
High
Medium
Low
Differences
Description:
Evaluate how completely the response identifies meaningful differences
Options:
High
Medium
Low
Captioning
Description:
Assess how well the caption captures all key elements in the image
Options:
High
Medium
Low
Spatial
Description:
Evaluate how accurately the target class's location is described
Options:
High
Medium
Low