logo
×

A comprehensive approach to evaluating text-to-video models

The emergence of text-to-video AI models has marked a significant milestone in artificial intelligence, with models from Runway ML (Gen-3), Luma Labs, and Pika transforming written descriptions into dynamic and lifelike videos. This technology is reshaping industries from video production to digital marketing, democratizing visual storytelling.

However, despite their impressive capabilities, these models often fall short of human expectations, producing results that lack prompt adherence, realism, or fidelity to the input text. To accelerate the development of text-to-video models, it is crucial to establish comprehensive evaluation methodologies to pinpoint areas for improvement.

This article presents a rigorous approach to assessing the strengths and limitations of Runway ML (Gen-3), Luma Labs, and Pika using human preference ratings. Let’s dive into how we systematically analyzed these leading models.


Human preference evaluation

To capture subjective human preference, we first set up the evaluation project using our specialized and highly-vetted Alignerr workforce, with consensus set to three human raters per prompt, allowing us to tap into a network of expert raters to evaluate video outputs across several key criteria:

Checklist for evaluating text-to-video models using human preferences.

Prompt adherence

Raters assessed how well each generated video matched the given text prompt on a scale of high, medium, or low. For example, in the given prompt: “A peaceful Zen garden with carefully raked sand, bonsai trees, and a small koi pond.” Raters looked to see if there was prompt adherence by looking at the presence of key concepts for the prompt.

  • Is there a garden?
  • Does it look peaceful?
  • Is the sand present, and is it raked?
  • Are there bonsai trees?
  • Is the a small koi pond?

Scoring

  • High: If all or most of the key concepts are present.
  • Medium:  If half the key concepts are present.
  • Low: If less than half of key concepts are present .
Example prompt 1: "A romantic Parisian street scene with couples walking and street musicians playing".

Video realism

Raters assessed how closely the video resembled reality.

Scoring

  • High: Realistic lighting, textures, and proportions.
  • Medium: Somewhat realistic but with slight issues in shadows or textures.
  • Low: Animated or artificial appearance.
Example prompt 2: "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

Video resolution

This criterion evaluates the level of detail and overall clarity in the video.

Scoring

  • High: Fine details visible (e.g., individual leaves, fabric textures).
  • Medium: Major elements are clear, but finer details are somewhat lacking.
  • Low: Overall blurry or lacking significant detail.
Example prompt 3: "An ancient temple in the jungle with hidden traps and treasures waiting to be discovered".

Artifacts

Raters identified any visible artifacts, distortions, or errors, such as:

  • Unnatural distortions in objects or backgrounds
  • Misplaced or floating elements
  • Inconsistent lighting or shadows
  • Unnatural repeating patterns
  • Unnatural movements
  • Blurred or pixelated areas

Scoring

  • High: If all or 5 of the errors are present. 
  • Medium: If 2 or 3 errors are present. 
  • Low:  If 1 or 0 errors are present. 
Example prompt 4: "Cat following a mouse".

Results

Our evaluation of 25 diverse set of complex prompts, generated by GPT-4, was stress-tested and provided valuable insights into the capabilities of Runway ML, Luma Labs and Pika. Each prompt was assessed by three different raters to ensure more accurate and diverse perspectives. This rigorous stress testing highlighted the strengths and areas for improvement for each model.

Let's delve into the insights and performance metrics of the top models: Runway ML (Gen-3), Luma Labs, and Pika.

Overall ranking for human-preference evaluations

  • Rank 1: Runway ML (Gen-3) ranked 1st in 65.22% of cases, Luma Labs in 18.84% and Pika in 15.94%
  • Rank 2: Luma Labs ranked 2nd in 59.42% of cases, Runway ML (Gen-3) in 21.74% and Pika in 18.84%
  • Rank 3: Pika ranked 3rd in 65.22% of cases, Luma Labs in 21.74% and  Runway ML in 13.04%
Rank 1 standings: Runway ML (Gen-3), Luma Labs, Pika

Model-Specific Performance results

Let's understand the relative strengths and weaknesses of each model.

Runway ML (Gen-3)

  • Prompt adherence: High in 59.42% of cases. This model excels at accurately reflecting the input prompts, making it a reliable choice for generating intended content.
  • Artifacts and Errors: High in 17.39% of cases. Although it performs well relative to other models, it still has room for improvement in minimizing artifacts and errors.
  • Video Realism: High in 46.38% of cases. Runway ML generates realistic videos nearly half the time, indicating strong capabilities in producing lifelike content.
  • Video Resolution: High in 56.52% of cases. This model is proficient in delivering high-resolution videos, enhancing the viewing experience.

Runway ML (Gen-3): "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

Luma Labs

  • Prompt adherence: High in 37.68% of cases. While not as consistent as Runway ML, it still performs reasonably well in adhering to prompts.
  • Artifacts and Errors: High in 17.39% of cases. Similar to Runway ML, it needs improvements to reduce visual defects.
  • Video Realism: High in 20.29% of cases. This model struggles more with realism, making it less suitable for applications requiring lifelike video content.
  • Video Resolution: High in 30.43% of cases. Luma Labs offers moderate video resolution quality but lags behind Runway ML.

Luma Labs: "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

Pika

  • Prompt adherence: High in 36.23% of cases. Comparable to Luma Labs, Pika maintains a fair level of consistency with input prompts.
  • Artifacts and Errors: High in 43.48% of cases. Pika has the highest occurrence of artifacts and errors, indicating significant areas for enhancement.
  • Video Realism: High in 20.29% of cases. Like Luma Labs, Pika also faces challenges in producing realistic videos.
  • Video Resolution: High in 23.19% of cases. Pika offers the least in terms of video resolution among the three models, suggesting a need for improvement in this aspect.

Pika: "A grand fantasy castle surrounded by lush landscapes and mythical creatures".

It's worth noting that the scope of this study was constrained by two key factors:

  • The absence of a public API for large-scale video generation from prompts;
  • Our deliberate use of a diverse prompt dataset.

This dataset encompassed a wide range of complexity, from simple to intricate descriptions. Additionally, we attempted to use automatic evaluations, such as assessing video quality based on all video frames and evaluations by large language models (LLMs) that support video. However, due to conflicting results, these methods were omitted from the blog post. 

Conclusion

Our comprehensive evaluation of state-of-the-art text-to-video models reveals a clear preference hierarchy among Runway ML (Gen-3), Luma Labs, and Pika. Runway ML (Gen-3) emerges as the top performer, securing the first rank in 65.22% of cases, thanks to its high prompt adherence and superior video resolution. However, it still exhibits a notable occurrence of artifacts and errors, suggesting room for enhancement.

Luma Labs, while trailing behind Runway ML, demonstrates moderate performance, particularly in maintaining prompt consistency and video resolution. Its primary weakness lies in generating realistic videos, which is crucial for lifelike content applications. On the other hand, Pika, ranking third, shows the highest need for improvement, especially in minimizing artifacts and enhancing video resolution.

While each model has its strengths and weaknesses, Runway ML (Gen-3) stands out for its robust performance across most evaluation criteria, making it the preferred choice for generating high-quality, realistic videos. As the field of text-to-video generation continues to evolve, addressing the identified shortcomings will be key to advancing the capabilities of these models.

By targeting these key areas, we can drive the next wave of innovations in text-to-video technology, creating more sophisticated and versatile text-to-video systems that cater to a broader range of applications and user needs.


Get started today

The comprehensive approach to evaluating text-to-video models presented here represents a significant advance in assessing AI-generated videos. When combined with Labelbox's platform, AI teams can accelerate the development and refinement of sophisticated, domain-specific video generation models with greater efficiency and quality through our dataset curation, automated evaluation techniques, and human-in-the-loop QA. 

If you're interested in implementing this evaluation approach or leveraging Labelbox's tools for your own text-to-video model assessment, sign up for a free Labelbox account to try it out, or contact us to learn more.

We'd love to hear from you and discuss how we can support your AI evaluation needs.