How Speak elevates language learning with high-quality AI data

Speak, a fast-growing language learning app, has made a name for itself by harnessing AI to create personalized and immersive learning experiences for millions of users worldwide. Their mission is to break down language barriers by using cutting-edge technology to help people learn new languages quickly and effectively.

The language learning app is designed to function as a tutor, offering support in many different languages including English and Spanish, with more languages to come soon. The entire product experience is built around advanced speech recognition and large language models (LLMs), which enable users to interact with the app in a natural and effective way. These technologies power the learning process, enhancing the user's ability to improve their language skills through real-time feedback and guidance.

Early roots in AI

Speak was founded in 2016 and, even then, had the vision to develop an AI-powered language tutor for English learners. However, at the time, the technology wasn’t advanced enough to support their ambitions. Deep learning was just starting to gain traction, but they anticipated the explosion of AI advancements and built their strategy around it. One of their key challenges has been dealing with the complexity of audio, which includes supporting a wide variety of accents, dialects, and phonetic corrections, often requiring labeled data for training custom speech models. However, achieving this goal at scale required vast amounts of high-quality, labeled data to train and continuously refine their AI models. This is where Labelbox came in.

The challenges of ensuring data quality at scale

For Speak, delivering accurate and nuanced language lessons meant working with diverse data sources, including audio clips, transcriptions, and video-based conversations. Their AI models needed to understand complex language dynamics such as tone, context, and regional dialects. As the user base grew, so did the volume of data. Managing and labeling this massive dataset while maintaining high-quality standards became a significant challenge.

The team consists of machine learning engineers focused on speech and language, leveraging large language models (LLMs) from OpenAI and harnessing off-the-shelf API integrated into a robust software stack. On the speech side, they work across various product experiences, particularly on different types of speech recognition. A key feature of their offering are the core lessons where users repeat a target sentence and receive immediate feedback on their pronunciation. As they progress, the task becomes harder - words are blanked out or the target comes in response to a question - turning the exercise into a Q&A format. These systems have to be reinforced to be highly accurate, customized, and fine-tuned.

The team was interested in using Labelbox as their primary data annotation platform, both for evaluating the accuracy of their speech systems and for training them to improve performance.

"One of the primary things we care about is the quality of our labels, especially when it comes to streaming speech recognition. It's not just about perfect pronunciation; it's about whether the user reads the line and if their speech is understandable to an English speaker. This task involves subjectivity, so we focus on how consistently we hit that standard. Our goal is to generate ground-truth datasets where there's strong agreement on quality. At the end of the day, quality is king—it matters much more than quantity. Achieving this level of quality requires iteration and a dedicated team of the same people who can work consistently on refining it which Labelbox helps us with." – Andrew Hsu, Co-founder and CTO of Speak

Previously, the process involved a lot of manual, tedious work, with business operations relying on a handful of individual contractors and countless spreadsheets. The data pipeline was also scattered, making it difficult to manage everything effectively. Labelbox allows Speak to centralize everything into one place, especially with a primary focus on speech-related tasks. This includes projects on pronunciation work, specifically around phonetics and phonemics. Eventually, the goal is to use large language models (LLMs) for simulating an AI language tutor, and providing corrections with the overall vision of replicating the experience of having a private human language tutor (which could typically cost upwards of $100 per hour), but delivered through AI at scale.

Finding a comprehensive data labeling platform for post-training

After exploring various solutions on the market, Speak chose Labelbox as their partner to streamline and enhance their data labeling efforts. Labelbox's intuitive and comprehensive platform offered a collaborative environment where Speak's internal team, along with external data annotators, could work together seamlessly. The platform's powerful tools for quality control, collaborative labeler feedback, and data management became crucial to Speak’s data operations. With Labelbox, Speak saw a marked improvement in their data pipeline.

"At the highest level, we now have a way to tackle something that used to be very painful—an automatic labeling loop that lets us evaluate our speech systems daily on production data. This loop allows us to continuously retrain our models, which has led to significant improvements. For example, we saw model accuracy improvements of 35% after using Labelbox, and what effectively translated into a 2x increase in speed in terms of accelerating model development." – Tobi Szuts, Machine Learning Engineer at Speak

Accelerating AI with Labelbox

Increased labeling efficiency: By integrating automation, Speak’s data labeling time was cut by nearly 50%, allowing them to release new features and languages at a faster pace.
Improved data quality: Labelbox’s real-time feedback loops and quality assurance tools helped maintain high labeling accuracy. This, in turn, enhanced the performance of Speak's AI models, improving the app's ability to recognize accents, tonal variations, and context in conversations.
Scalable operations: As Speak expanded to new regions, Labelbox’s scalable and unified platform, coupled with post-training labeling services, is serving as a data engine to make it easy to onboard a diverse set of annotators and manage a growing dataset without compromising on quality.

The future of personalized language learning

Speak has now established a solid foundation for data quality assurance, with basic features in place such as handling multiple labels for both Spanish and English on the same platform. They leverage Google Cloud Platform (GCP) for much of their infrastructure. Additionally, the company has a long-standing relationship with OpenAI, integrating their models into the workflow for enhanced performance.

For Speak, partnering with Labelbox has been transformative. By improving the quality and efficiency of their data labeling, Speak’s AI models are now more capable of delivering nuanced and personalized language learning experiences. With Labelbox, Speak continues to push the boundaries of what's possible in language education, helping users around the world become fluent in new languages faster and more effectively.