Beyond benchmarks ( Leaderboards ) v5

Beyond the benchmarks

Beyond benchmarks

Beyond the benchmarks hero video

Diverse pool of US-based Alignerrs, including generalists and creative artists

Consensus of three Alignerrs per task

Standardized instructions and ontology for consistent evaluations

Carefully curated prompt generation process, balancing creativity and clarity

Image

Imagen 3

DALL·E 3

Flux 1.1 Pro

Stable Diffusion 3

Ideogram 2.0

Recraft v3

Image-generation-v2

Audio

Speech-generation v2

Video

Video-generation-v2

Multimodal reasoning

Complex reasoning

Leaderboard footer

Assesses a text-to-speech model’s ability to understand contextual information throughout the text and adapt its output based on linguistic and situation context. Examples includes tone adjustment, emphasis & rhythm changes, and punctuation interpretation.

Measures how correctly a speech synthesis or recognition system produces or identifies the sounds of a language. It assesses the model's ability to generate or recognize proper phonemes, stress patterns, and intonation according to the rules and norms of the target language.

It measures factors such as prosody, rhythm, and intonation to determine if the generated speech closely mimics natural human speech patterns, avoiding robotic or unnatural-sounding output.

Assess your overall satisfaction with the generated image given the input prompt.

Evaluate how well the input prompt is represented in the output image content.

Judge how visually appealing the generated images are, regardless of the requested content.

Assess your overall satisfaction with the generated video given the input prompt.

Evaluate how well the input prompt is represented in the output video content.

It measures whether the video looks like genuine footage, with realistic movements, lighting, and details, as opposed to appearing artificial or computer-generated.

This is a dynamic rating system used in competitive games to rank players. In this context, it's applied to models. Higher Elo ratings indicate better performance based on head-to-head rankings. The Elo system adjusts ratings based on how well models perform against each other, and the K-factor (32) determines how much the rating changes after each match.

What is “Elo Rating”?

The percentage of times a model was ranked in the top (Rank #1) down to the bottom (Rank #5). Higher percentages for Rank #1 mean that the model frequently performed better compared to other models.

What is “Win %”?

Another ranking system, primarily used in games like Halo. It factors in both a model's performance (mean, denoted by `mu`) and uncertainty (variance). Here, we're primarily interested in `mu`, which reflects the model's expected performance, with higher values indicating better performance.

What is “TrueSkill Rating”?

This represents the average ranking a model received based on user classifications. A lower average rank means the model performed better overall (e.g., Rank #1 would give a lower average than Rank #5).

What is “Average rank”?

Word Error Rate (WER) is a metric used to assess the accuracy of speech recognition models by comparing the recognized text to the original transcript. It calculates the minimum number of word insertions, deletions, and substitutions required to transform the recognized text into the reference text, divided by the total number of words in the reference.

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads

Beyond benchmarks

Leaderboards

Image generation

Speech generation

Video generation

Multimodal-reasoning

Complex-reasoning

Want us to evaluate your model?