How a leading text-to-speech AI lab harnesses human-generated data to deliver generative AI faster
Problem
A leading text-to-speech company needed to improve their AI models by sourcing new, high-quality human-generated data. They had two key challenges: increase the scale of data to accelerate project timelines and source the data quickly to minimize data bottlenecks during post-training. To deliver the necessary text-to-speech audio transcriptions, they required dedicated tooling and labeling teams to meet narrow development timeframes.
Solution
The company turned to the Labelbox data labeling platform, which offered them full control over their data pipeline and visibility into the quality of data labeling in real-time to optimize quality assurance (QA). They also used the Labelbox Labeling Services (powered by Alignerrs) to improve their text-to-speech data quality by sourcing highly-skilled labelers to identify nuances in speech such as pitch, accents, pace, and pronunciation.
Result
The human-generated transcription data powered by Alignerrs improved the company's data accuracy by over 3x when compared to previous data. They sped up new model development, going from months to only weeks of development as a result of the Labelbox Labeling Services’ ability to deliver the specialized, on-demand labeling for their text-to-speech development.
As a developer of groundbreaking text-to-speech technology, the company’s mission is to build innovative AI products that offer an easy, automatic way to convert videos and podcasts into other languages. The company was built on the idea that users wanted a way to not just convert speech from one language to another but to also preserve the original voice and emotions of the audio. The added benefit of translating more than just the words enables businesses and creators to rapidly publish their videos into other languages without losing the original feel and emotions of the audio. All this became possible with new generative AI technology and powerful post-training methods.
To build and improve their text-to-speech AI models, the company was searching for solutions to help them meet their increasing demand for human-generated audio transcription data. They needed turnaround times that matched their product development speeds. However, generating and labeling all of this data internally was a complex operation that involved multiple layers of coordination and collaboration between internal labeling resources and external labeling services. Tight collaboration was essential to ensuring the highest quality data.
The complexity and labor-intensive nature of data labeling posed a significant challenge for the company. They had to build and manage dedicated tooling and labeling teams to meet narrow timeframes during their calibration and production stages. Furthermore, the company had to handle complex audio transcriptions that required audio files to be annotated quickly alongside existing transcriptions that their models generated--all with a rapid turnaround time and high urgency.
To address these challenges, the company adopted Labelbox’s data labeling platform, which offered them high levels of control and transparency over their data pipeline. The platform provided granular visibility and built-in quality assurance capabilities that allowed them to spot check specific areas of interest. In addition, the company tapped into the Labelbox Labeling Services to improve their text-to-speech data quality by highlighting nuances in speech such as pitch, accents, pace and pronunciation.
Labelbox’s Labeling Services, powered by the Alignerr community of experts, spans all major languages as well as a diverse range of advanced subjects. It's a community of highly-skilled labelers from around the world. To ensure a streamlined AI data factory, the company worked closely with Labelbox's team to iterate on the quality bar needed for both data annotation and model evaluation. They benefited from the flexibility to experiment with using their own internal teams, external expert labelers, and a hybrid of both to determine the ideal combination to create the best data for each post-training task.
In terms of results, the GenAI company is now able to improve the accuracy of their human-generated transcription data by over 3x compared to other solutions on the market. They sped up their new model development from months to weeks using the Labelbox Labeling Services, which delivered labeling experts ready to support their text-to-speech development. As a next step, the company is scaling their post-training data labeling efforts with Labelbox and their global network of specialized human raters.