logo
×

Labelbox Leaderboards: Redefining AI evaluations with human-centric assessments

In the rapidly evolving landscape of artificial intelligence, traditional benchmarks are no longer sufficient to capture the true capabilities of AI models. At Labelbox, we're excited to introduce our groundbreaking Labelbox leaderboards—an innovative, scientific process to rank multimodal AI models that goes beyond conventional benchmarks.

The limitations of current benchmarks and leaderboards

Benchmark contamination

One of the most pressing issues in AI evaluation today is benchmark contamination. As large language models are trained on vast amounts of internet data, they often inadvertently include the very datasets used to evaluate them. This leads to inflated performance metrics that don't accurately reflect real-world capabilities. For example:

  • The LAMBADA dataset, designed to test language understanding, has been found in the training data of several popular language models, with an LM Contamination Index of 29.3%.
  • Portions of the SQuAD question-answering dataset have been discovered in the pretraining corpora of multiple large language models.
  • Even coding benchmarks like HumanEval have seen their solutions leaked online, potentially contaminating future model training.

This contamination makes it increasingly difficult to trust traditional benchmark results, as models may be “cheating” by memorizing test data rather than demonstrating true understanding or capability.

Existing leaderboards: A step forward, but not enough

While several leaderboards have emerged to address the limitations of traditional benchmarks, they each come with their own set of challenges.

LMSYS chatbot arena

LMSYS Chatbot Arena, despite its broad accessibility, faces notable challenges in providing objective AI evaluations. Its reliance on non-expert assessments and emphasis on chat-based evaluations may introduce personal biases, potentially favoring engaging responses over true intelligence. Researchers worry that this approach could lead companies to prioritize optimizing for superficial metrics rather than genuine real-world performance. Furthermore, LMSYS's commercial ties raise concerns about impartiality and the potential for an uneven evaluation playing field, as usage data may be selectively shared with certain partners.

Scale AI's SEAL

Scale’s Safety, Evaluations, and Alignment Lab (SEAL), released few months ago, offers detailed insights/evaluations for topics such as reasoning, coding, and agentic tool use. However, the infrequent updates and primary focus on language models, while useful, may not capture the full spectrum of rapidly advancing multimodal AI capabilities.

Challenges in AI evaluation

These and other existing leaderboards all run into core challenges with AI evaluations:

1) Data contamination and overfitting to public benchmarks

2) Scalability issues as models improve and more are added

3) Lack of standards for evaluation instructions and criteria

4) Difficulty in linking evaluation results to real-world outcomes

5) Potential bias in human evaluations

Introducing the Labelbox Leaderboards: A comprehensive approach to AI evaluation 

The Labelbox Leaderboards are the first to tackle these challenges by conducting structured evaluations on subjective AI model outputs using human experts and a scientific process that provides detailed feature-level metrics and multiple ratings. Leaderboards are available for Complex Reasoning, Multimodal Reasoning, Image Generation, Speech Generation, and Video Generation.

Our goal is to go beyond traditional leaderboards and benchmarks by incorporating the following elements:

1. Multimodal and niche focus

Unlike leaderboards that primarily focus on text-based large language models, we evaluate a diverse range of AI modalities and specialized applications, including:

  • Image generation and analysis
  • Audio processing and synthesis
  • Video creation and manipulation
  • Complex and multimodal reasoning

2. Expert human evaluation

For every evaluation, public or private, it’s critical for the raters to reflect your target audience. We place expert human judgment, using our Alignerr workforce, at the core of the evaluation process to ensure:

  • Subjective quality assessment: Humans assess aspects like aesthetic appeal, realism, and expressiveness.
  • Contextual understanding: Evaluators consider the broader context and intended use.
  • Alignment with human preferences: Raters ensure evaluations reflect criteria that matter to end-users.
  • Resistance to contamination: Human evaluations on novel tasks are less prone to data contamination.

3. Reliable and transparent methodology

We are committed to performing trustworthy evaluations using a variety of sophisticated metrics. Labelbox balances privacy with openness by providing detailed feature-level metrics (e.g. prompt alignment, visual appeal, and numerical count for text-image models) and multiple ratings.

In addition to critical human experts performing the evaluations, our methodology utilizes the Lablebox Platform to generate advanced metrics on both the rater and model performance. We provide the following metrics across our three leaderboards: 

  • Elo rating system: Adapted from competitive chess, our Elo system provides a dynamic rating that adjusts based on head-to-head comparisons between models. This allows us to capture relative performance in a way that's responsive to improvements over time.
  • TrueSkill rating: Originally developed for Xbox Live, TrueSkill offers a more nuanced rating that accounts for both a model's performance and the uncertainty in that performance. This is particularly useful for newer models or those with fewer evaluations.
  • Rank percentages: We track how often each model achieves each rank (1st through 5th) in direct comparisons. This provides insight into not just average performance, but consistency of top-tier results.
  • Average rating: A straightforward metric that gives an overall sense of model performance across all evaluations.

In addition to these key metrics, our methodology incorporates the following characteristics to ensure a balanced and fair evaluation: 

  • Expert evaluators: Utilizing skilled professionals from our Alignerr platform to provide nuanced, context-aware assessments.
  • Comprehensive and novel datasets: Curated to reflect real-world scenarios while minimizing contamination.
  • Transparent reporting: Detailed insights into our methodologies and results without compromising proprietary information.

4. Continuously updated evaluations

Our leaderboard isn't static; we plan to regularly update our evaluations to include the latest models and evaluation metrics, ensuring stakeholders have access to current and relevant information.

Leaderboard insights: A glimpse into the image generation leaderboard

To illustrate the power of our comprehensive evaluation approach, let's explore the image generation leaderboard. For each evaluation of the latest image-generating models, we capture and publish four key pieces of data to help understand capabilities and areas of opportunity for each model.

1) Elo ratings:

  • GPT Image 1 leads with 1069.17, followed by GPT 4.1 at 1039.62 and Recraft v3 at 1039.37

2) TrueSkill ratings:

  • GPT Image 1 again leads with 982.86, with GPT 4.1 following at 979.89
  • This indicates high expected performance for GPT Image 1with relatively low uncertainty

3) Rank percentages:

  • GPT 4.1 achieves the top rank 60.14% of the time, followed by GPT Image 1 at 59.31%
  • This shows GPT 4.1 consistency in achieving top results, but also highlights GPT Image 1 and DALL-E 3’s strong performance

4) Average rank (lower better):

Note: Lower score is better here so GPT 4.1 leads in average rank.
  • GPT 4.1 slightly edges out GPT Image 1 with an average rating of 1.4 vs 1.41 (lower is better)
  • This suggests GPT 4.1 performs well in direct comparisons despite lower Elo and TrueSkill ratings

These metrics provide a multi-faceted view of model performance, allowing users to understand not just which model is "best" overall, but which might be most suitable for their specific use case. For instance, while GPT Image 1 and GPT 4.1 leads in most metrics, DALL-E 3’s and Imagen 3’s strong average rating suggests it is a reliable choice for consistent performance across a range of tasks.

Join the revolution: Beyond the benchmark

The Labelbox Leaderboards represent a significant advance in AI evaluation, pushing past traditional leaderboards by incorporating expert human evaluations for subjective generative AI models using comprehensive metrics. We are uniquely able to achieve this thanks to our modern AI data factory that combines human experts and our scalable platform with years of operational excellence evaluating AI models.

We invite you to:

  • Check out the Labelbox leaderboards to explore our latest evaluations across various AI modalities and niche applications.
  • Let us know if you have suggestions or want a specific model included in future assessments.
  • Contact us to learn more about how we can help you evaluate and improve your AI models across all modalities.

Ready to go beyond the benchmark? Let's redefine AI evaluation — together—and drive the field toward more meaningful, human-aligned progress that truly captures the capabilities of next-generation AI models.