How a leading text-to-speech AI lab harnesses human-generated data to deliver generative AI faster

As a developer of groundbreaking text-to-speech technology, the company’s mission is to build innovative AI products that offer an easy, automatic way to convert videos and podcasts into other languages. The company was built on the idea that users wanted a way to not just convert speech from one language to another but to also preserve the original voice and emotions of the audio. The added benefit of translating more than just the words enables businesses and creators to rapidly publish their videos into other languages without losing the original feel and emotions of the audio. All this became possible with new generative AI technology and powerful post-training methods.

To build and improve their text-to-speech AI models, the company was searching for solutions to help them meet their increasing demand for human-generated audio transcription data. They needed turnaround times that matched their product development speeds. However, generating and labeling all of this data internally was a complex operation that involved multiple layers of coordination and collaboration between internal labeling resources and external labeling services. Tight collaboration was essential to ensuring the highest quality data.

The complexity and labor-intensive nature of data labeling posed a significant challenge for the company. They had to build and manage dedicated tooling and labeling teams to meet narrow timeframes during their calibration and production stages. Furthermore, the company had to handle complex audio transcriptions that required audio files to be annotated quickly alongside existing transcriptions that their models generated--all with a rapid turnaround time and high urgency.

To address these challenges, the company adopted Labelbox’s data labeling platform, which offered them high levels of control and transparency over their data pipeline. The platform provided granular visibility and built-in quality assurance capabilities that allowed them to spot check specific areas of interest. In addition, the company tapped into the Labelbox Labeling Services to improve their text-to-speech data quality by highlighting nuances in speech such as pitch, accents, pace and pronunciation.

Labelbox’s Labeling Services, powered by the Alignerr community of experts, spans all major languages as well as a diverse range of advanced subjects. It's a community of highly-skilled labelers from around the world. To ensure a streamlined AI data factory, the company worked closely with Labelbox's team to iterate on the quality bar needed for both data annotation and model evaluation. They benefited from the flexibility to experiment with using their own internal teams, external expert labelers, and a hybrid of both to determine the ideal combination to create the best data for each post-training task.

In terms of results, the GenAI company is now able to improve the accuracy of their human-generated transcription data by over 3x compared to other solutions on the market. They sped up their new model development from months to weeks using the Labelbox Labeling Services, which delivered labeling experts ready to support their text-to-speech development. As a next step, the company is scaling their post-training data labeling efforts with Labelbox and their global network of specialized human raters.

How a leading text-to-speech AI lab harnesses human-generated data to deliver generative AI faster

Problem

Solution

Result

Try Labelbox today