New cutting-edge audio models set to transform the AI landscape
Problem
A growing generative AI audio startup sought to improve their voice, speech, and sound models through expert data training. However, they faced the challenge of labeling a large volume of audio data to improve their model’s temporal understanding, a task complicated by its high subjectivity. They needed a specialized group of AI trainers skilled in voice and speech to accurately identify nuanced audio.
Solution
Labelbox’s data factory combined advanced software for labeling complex audio data with a specially sourced team of trainers from fine arts backgrounds, including theater, performing arts, and voice acting. The experts used the Labelbox Platform’s custom audio editor to label complex audio segments down to the millisecond.
Result
With an efficient workflow in place, the audio-focused startup was able to quickly obtain high-quality, differentiated datasets describing emotionally charged or uniquely styled audio segments. They can now effectively train their models to produce realistic audio content, increasing the adoption of their cutting-edge AI audio technology.

Introduction
Audio tasks are rapidly becoming critical in the evolution of AI, as voice interfaces and audio-driven insights transform how humans interact with technology. From improving text-to-speech systems to discerning speaker intent to enabling audio translation, training AI with high-quality data to understand and generate nuanced audio will differentiate the next wave of leading frontier models.
A fast-growing generative AI startup that develops innovative audio AI models wanted to enhance their voice, text-to-speech, and sound capabilities. To do that, they needed to evaluate and label large volumes of subjective, temporal audio data. They had specific needs where the data had to be framed as commands to train their models to recognize sentiment and emotion in human speech.
However, this was no easy feat. Labeling subjective audio data is a challenging task due to varying interpretations of emotional cues and speech patterns, situational context influencing emotions, ambiguity in mixed emotional expressions, and potential human biases.
This combined with the need for high-precision segmentation, posed a significant hurdle for the company. Lacking the necessary expertise at scale and the right tools for labeling complex audio data, they turned to the on-demand services and software available from the Labelbox data factory to handle the task.
Accurately annotating nuanced audio data
The company provided a large dataset of audio files that needed to be listened to and annotated to identify "interesting" segments—defined by high emotion (anger, happiness, disgust) or special speech styles (sarcasm, slurring, whining). Once identified, experts described how these segments were spoken, writing the descriptions as commands to guide the replication of speech patterns, tone, emotion, and style.
Labelbox’s Labeling Services, powered by our Alignerr network of highly-skilled human talent, quickly assembled and onboarded a skilled team of experts with backgrounds in theater, performing arts, and voice acting. We discovered that experts with these backgrounds were particularly adept at identifying changes in emotion and providing detailed descriptions of the emotions conveyed in the voice.
“Through years of performing and teaching the arts, I've developed a deep understanding of voice dynamics. I have mental checklists for how and where voices change, which makes it natural for me to identify the various emotions in speech and understand their impact on the listener." – Jeff K., PhD in Theatre and Performance Studies, MFA in Dance
Additionally, Labelbox’s dedicated audio editor, with built-in features like audio waveforms, custom ontologies with temporal classifications, and millisecond-level timestamps, enabled experts to accurately segment and label the data with the precision required for this task. The team was also able to use the new auto-transcription feature within the audio editor, powered by the Whisper model, that automatically transcribes audio segments with just a few clicks.
Advancing the frontiers of AI audio
Competing with other data providers, the company chose Labelbox for our ability to rapidly curate a team of experts that could analyze audio samples and generate high-quality audio data. By using our advanced platform, the trainers could granularly label the data through custom ontologies, waveforms, and timestamps.
These leaders in the audio space now rely on the Labelbox data factory to drive innovation in the competitive AI market.