Human preference evaluation
Results
1. Overall ranking for human-preference evaluations
2. Model-Specific Performance results
Conclusion
Get started today

A comprehensive approach to evaluating text-to-video models

The emergence of text-to-video AI models has marked a significant milestone in artificial intelligence, with models from Runway ML (Gen-3), Luma Labs, and Pika transforming written descriptions into dynamic and lifelike videos. This technology is reshaping industries from video production to digital marketing, democratizing visual storytelling.

However, despite their impressive capabilities, these models often fall short of human expectations, producing results that lack prompt adherence, realism, or fidelity to the input text. To accelerate the development of text-to-video models, it is crucial to establish comprehensive evaluation methodologies to pinpoint areas for improvement.

This article presents a rigorous approach to assessing the strengths and limitations of Runway ML (Gen-3), Luma Labs, and Pika using human preference ratings. Let’s dive into how we systematically analyzed these leading models.

Human preference evaluation

To capture subjective human preference, we first set up the evaluation project using our specialized and highly-vetted Alignerr workforce, with consensus set to three human raters per prompt, allowing us to tap into a network of expert raters to evaluate video outputs across several key criteria:

Checklist for evaluating text-to-video models using human preferences.

Prompt adherence

Raters assessed how well each generated video matched the given text prompt on a scale of high, medium, or low. For example, in the given prompt: “A peaceful Zen garden with carefully raked sand, bonsai trees, and a small koi pond.” Raters looked to see if there was prompt adherence by looking at the presence of key concepts for the prompt.

Is there a garden?
Does it look peaceful?
Is the sand present, and is it raked?
Are there bonsai trees?
Is the a small koi pond?

Scoring

High: If all or most of the key concepts are present.
Medium: If half the key concepts are present.
Low: If less than half of key concepts are present .

Example prompt 1: "A romantic Parisian street scene with couples walking and street musicians playing".

Video realism

Raters assessed how closely the video resembled reality.

Scoring

High: Realistic lighting, textures, and proportions.
Medium: Somewhat realistic but with slight issues in shadows or textures.
Low: Animated or artificial appearance.

Example prompt 2: "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

Video resolution

This criterion evaluates the level of detail and overall clarity in the video.

Scoring

High: Fine details visible (e.g., individual leaves, fabric textures).
Medium: Major elements are clear, but finer details are somewhat lacking.
Low: Overall blurry or lacking significant detail.

Example prompt 3: "An ancient temple in the jungle with hidden traps and treasures waiting to be discovered".

Artifacts

Raters identified any visible artifacts, distortions, or errors, such as:

Unnatural distortions in objects or backgrounds
Misplaced or floating elements
Inconsistent lighting or shadows
Unnatural repeating patterns
Unnatural movements
Blurred or pixelated areas

Scoring

High: If all or 5 of the errors are present.
Medium: If 2 or 3 errors are present.
Low: If 1 or 0 errors are present.

Example prompt 4: "Cat following a mouse".

Results

Our evaluation of 25 diverse set of complex prompts, generated by GPT-4, was stress-tested and provided valuable insights into the capabilities of Runway ML, Luma Labs and Pika. Each prompt was assessed by three different raters to ensure more accurate and diverse perspectives. This rigorous stress testing highlighted the strengths and areas for improvement for each model.

Let's delve into the insights and performance metrics of the top models: Runway ML (Gen-3), Luma Labs, and Pika.

Overall ranking for human-preference evaluations

Rank 1: Runway ML (Gen-3) ranked 1st in 65.22% of cases, Luma Labs in 18.84% and Pika in 15.94%
Rank 2: Luma Labs ranked 2nd in 59.42% of cases, Runway ML (Gen-3) in 21.74% and Pika in 18.84%
Rank 3: Pika ranked 3rd in 65.22% of cases, Luma Labs in 21.74% and Runway ML in 13.04%

Rank 1 standings: Runway ML (Gen-3), Luma Labs, Pika

Model-Specific Performance results

Let's understand the relative strengths and weaknesses of each model.

Runway ML (Gen-3)

Prompt adherence: High in 59.42% of cases. This model excels at accurately reflecting the input prompts, making it a reliable choice for generating intended content.
Artifacts and Errors: High in 17.39% of cases. Although it performs well relative to other models, it still has room for improvement in minimizing artifacts and errors.
Video Realism: High in 46.38% of cases. Runway ML generates realistic videos nearly half the time, indicating strong capabilities in producing lifelike content.
Video Resolution: High in 56.52% of cases. This model is proficient in delivering high-resolution videos, enhancing the viewing experience.

Runway ML (Gen-3): "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

Luma Labs

Prompt adherence: High in 37.68% of cases. While not as consistent as Runway ML, it still performs reasonably well in adhering to prompts.
Artifacts and Errors: High in 17.39% of cases. Similar to Runway ML, it needs improvements to reduce visual defects.
Video Realism: High in 20.29% of cases. This model struggles more with realism, making it less suitable for applications requiring lifelike video content.
Video Resolution: High in 30.43% of cases. Luma Labs offers moderate video resolution quality but lags behind Runway ML.

Luma Labs: "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

Pika

Prompt adherence: High in 36.23% of cases. Comparable to Luma Labs, Pika maintains a fair level of consistency with input prompts.
Artifacts and Errors: High in 43.48% of cases. Pika has the highest occurrence of artifacts and errors, indicating significant areas for enhancement.
Video Realism: High in 20.29% of cases. Like Luma Labs, Pika also faces challenges in producing realistic videos.
Video Resolution: High in 23.19% of cases. Pika offers the least in terms of video resolution among the three models, suggesting a need for improvement in this aspect.

Pika: "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

It's worth noting that the scope of this study was constrained by two key factors:

The absence of a public API for large-scale video generation from prompts;
Our deliberate use of a diverse prompt dataset.

This dataset encompassed a wide range of complexity, from simple to intricate descriptions. Additionally, we attempted to use automatic evaluations, such as assessing video quality based on all video frames and evaluations by large language models (LLMs) that support video. However, due to conflicting results, these methods were omitted from the blog post.

Conclusion

Our comprehensive evaluation of state-of-the-art text-to-video models reveals a clear preference hierarchy among Runway ML (Gen-3), Luma Labs, and Pika. Runway ML (Gen-3) emerges as the top performer, securing the first rank in 65.22% of cases, thanks to its high prompt adherence and superior video resolution. However, it still exhibits a notable occurrence of artifacts and errors, suggesting room for enhancement.

Luma Labs, while trailing behind Runway ML, demonstrates moderate performance, particularly in maintaining prompt consistency and video resolution. Its primary weakness lies in generating realistic videos, which is crucial for lifelike content applications. On the other hand, Pika, ranking third, shows the highest need for improvement, especially in minimizing artifacts and enhancing video resolution.

While each model has its strengths and weaknesses, Runway ML (Gen-3) stands out for its robust performance across most evaluation criteria, making it the preferred choice for generating high-quality, realistic videos. As the field of text-to-video generation continues to evolve, addressing the identified shortcomings will be key to advancing the capabilities of these models.

By targeting these key areas, we can drive the next wave of innovations in text-to-video technology, creating more sophisticated and versatile text-to-video systems that cater to a broader range of applications and user needs.

Get started today

The comprehensive approach to evaluating text-to-video models presented here represents a significant advance in assessing AI-generated videos. When combined with Labelbox's platform, AI teams can accelerate the development and refinement of sophisticated, domain-specific video generation models with greater efficiency and quality through our dataset curation, automated evaluation techniques, and human-in-the-loop QA.

If you're interested in implementing this evaluation approach or leveraging Labelbox's tools for your own text-to-video model assessment, sign up for a free Labelbox account to try it out, or contact us to learn more.

We'd love to hear from you and discuss how we can support your AI evaluation needs.

Continue reading

Programmatically launch human data jobs for RLHF and evaluation

Learn how to harness the SDK to manage human data labeling jobs for RLHF and model evaluation. With just a few steps, you can set up the SDK, import various types of data, and launch, monitor, and export labeling projects programmatically, all while ensuring data quality and scalability.

Evaluating leading text-to-speech models

Discover how to employ a more comprehensive approach to evaluating leading text-to-speech models using both human preference ratings and automated evaluation techniques.

Metrics-based RAG Development with Labelbox

Learn how to optimize your Retrieval-Augmented Generation (RAG) applications by focusing on key metrics like context recall and precision.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads

A comprehensive approach to evaluating text-to-video models

Human preference evaluation

Prompt adherence

Video realism

Video resolution

Artifacts

Results

Overall ranking for human-preference evaluations

Model-Specific Performance results

Runway ML (Gen-3)

Luma Labs

Pika

Conclusion

Get started today

Continue reading

Try Labelbox today