How a leading text-to-speech AI lab harnesses human-generated data to deliver generative AI faster

A text-to-speech lab translates videos while preserving the original voice and emotion. Labelbox's platform produced the expert-graded transcription signal, 3x more accurate, for post-training.

The challenge

A text-to-speech company builds AI that converts videos and podcasts into other languages — preserving not just the words but the original voice and emotion, so creators and businesses can publish across languages without losing the feel. That capability rests on generative AI and post-training. Improving the models took high-quality human-generated audio transcription signal, fast, at the pace of product development. Producing it in-house meant coordinating internal and external experts, building dedicated tooling, and grading complex audio against existing model-generated transcriptions — under tight timeframes and high urgency.

The approach

The company adopted Labelbox for control and transparency over its data pipeline — granular visibility and built-in quality assurance to spot-check the areas that mattered. Through its Alignerr network, spanning major languages and advanced subjects, Labelbox's platform produced expert-graded speech signal that captured nuances like pitch, accent, pace, and pronunciation. The company iterated with Labelbox on the quality bar for both signal generation and model evaluation, and had the flexibility to mix internal teams, Labelbox's expert network, or both to produce the best signal for each post-training task.

The outcome

The company improved the accuracy of its human-generated transcription signal by over 3x compared to other solutions on the market, and compressed new model development from months to weeks. It's now scaling its post-training signal with Labelbox and its global network of specialized experts.

Where this goes

Voice is a frontier modality, and emotion is the hard part. Expert-graded speech signal is what lets a model translate not just words, but how something was said.

Expert speech signal for voice-preserving translation models

Problem

Solution

Result

The challenge

The approach

The outcome

Where this goes

Try Labelbox today