
Daily evaluation signal that retrains Speak's speech models
Problem
Delivering accurate, nuanced language lessons meant working with diverse data — audio clips, transcriptions, video conversations — and models that understand tone, context, and regional dialect. As Speak's user base grew, so did the data. Producing high-quality signal at that scale became the challenge.
Solution
Speak chose Labelbox to produce the signal that trains and evaluates its speech systems. The platform centralized data operations — quality control, collaborative feedback, and data management in one place — and, with post-training services, became the data engine behind Speak's speech models.
Result
Speak cut its signal-production time by nearly 50%, releasing new features and languages faster, and saw model accuracy improvements of up to 35%.

Speak's AI language tutor runs on speech recognition and LLMs. Labelbox's platform produces the expert-graded signal and the daily evaluation loop that retrains those models on production data.
The challenge
Speak is a fast-growing language learning app that uses AI to create personalized, immersive learning for millions of users worldwide, in English and Spanish with more languages coming. The product works like a tutor, built on speech recognition and large language models that give real-time feedback. Founded in 2016 with the vision of an AI language tutor, Speak bet on AI before the technology could support it. The hard part is audio: supporting a wide variety of accents, dialects, and phonetic corrections across audio clips, transcriptions, and video conversations, with models that understand tone, context, and regional dialect. As the user base grew, so did the data — and producing high-quality signal at that scale became the challenge.
The approach
Speak chose Labelbox to produce the signal that trains and evaluates its speech systems. The platform centralized work that used to depend on individual contractors and scattered spreadsheets, with quality control, collaborative feedback, and data management in one place. Speak's core lessons have users repeat a target sentence and get immediate pronunciation feedback; harder levels blank out words or turn the exercise into Q&A. Those systems have to be highly accurate, customized, and fine-tuned — and that takes consistent, high-quality ground-truth signal.
One of the primary things we care about is the quality of our labels, especially when it comes to streaming speech recognition. It's not just about perfect pronunciation; it's about whether the user reads the line and if their speech is understandable to an English speaker. This task involves subjectivity, so we focus on how consistently we hit that standard. Our goal is to generate ground-truth datasets where there's strong agreement on quality. At the end of the day, quality is king—it matters much more than quantity. Achieving this level of quality requires iteration and a dedicated team of the same people who can work consistently on refining it which Labelbox helps us with.
— Andrew Hsu, Co-founder and CTO of Speak

The outcome
With Labelbox, Speak built an automatic labeling loop that evaluates its speech systems daily on production data and continuously retrains the models.
At the highest level, we now have a way to tackle something that used to be very painful—an automatic labeling loop that lets us evaluate our speech systems daily on production data. This loop allows us to continuously retrain our models, which has led to significant improvements. For example, we saw model accuracy improvements of 35% after using Labelbox, and what effectively translated into a 2x increase in speed in terms of accelerating model development.
— Tobi Szuts, Machine Learning Engineer at Speak

Increased efficiency: Speak's signal-production time was cut by nearly 50%, allowing it to release new features and languages at a faster pace.
Improved signal quality: real-time feedback loops and quality assurance kept accuracy high, improving the models' ability to recognize accents, tonal variations, and context in conversations.
Scalable operations: as Speak expanded to new regions, Labelbox's unified platform plus post-training services acted as a data engine to onboard a diverse set of contributors and manage a growing dataset without compromising quality.
Where this goes
Speak now has a foundation for data-quality assurance across English and Spanish, on GCP infrastructure with OpenAI models in the loop. The pattern is a learning loop: production data in, expert-graded signal out, models that get better every day — the vision of a private tutor that could cost upwards of $100 per hour, delivered through AI at scale.