
How Ancestry trains models to read historical records faster
Problem
Ancestry's data science team wanted more efficient ways to decode census data and train its ML models faster — shifting from solely building models to a data-centric approach and optimizing its whole MLOps pipeline. The bottleneck was producing high-quality signal and looping in domain experts who understand historical documents.
Solution
Ancestry adopted Labelbox, using model-assisted labeling, annotation relationships, and labeling services in the native image and text editors to produce signal faster — and to bring domain experts into the labeling and review loop.
Result
Using Labelbox as a data engine, Ancestry can better contextualize its historical data, maintain signal quality, save time collaborating with domain experts, and train and test new models in record time.

Ancestry uses neural networks and transformers to extract genealogical data from historical artifacts. Labelbox's platform produces the expert-graded signal and the collaboration loop that got the team to weekly model iteration.
Note: This post is a shortened recap of a virtual talk from Stanley Fujimoto, Data Scientist at Ancestry during Labelbox Accelerate (Nov 2022).
The challenge
Ancestry combines billions of historical records, millions of family trees, and AncestryDNA samples. Its data science team uses neural networks and transformers to extract genealogical data from historical artifacts — decoding census data and training models faster. The team wanted to shift from building models to a data-centric approach and optimize its whole MLOps pipeline. The bottleneck was signal: data scientists owned labeling end to end, and when they didn't write the specs themselves, domain experts with deep knowledge of historical documents weren't easily looped into labeling and review. Getting data labeled and reviewed took forever.
The approach
Ancestry adopted Labelbox. Working in the native image and text editors, the team used model-assisted labeling, annotation relationships, and Labelbox's labeling services to produce signal faster — and brought its domain experts into the labeling and review loop.
Before Labelbox, we could train a model pretty quickly and evaluate against validation test sets, but getting data labeled and reviewed took forever. Having a strong collaborative annotation platform helped us get to a weekly iteration cycle”, said Stanley Fujimoto, Data Scientist at Ancestry.
Analytics and in-depth metrics showed how signal was being produced, so the team could communicate in real time and correct labels as needed, including annotating images and text in PDF documents at scale.
our team is able to collaborate more efficiently by dropping a pin on any image [asset], where we’ll have a question and we can respond to our labelers in line. For other platforms, the labeling process is a complete black box as we have to wait for all of our labels to come back before we can review and give any feedback. We've found that writing labeling specs always includes some level of ambiguity. For that reason, clarification and iteration speed is essential.”
— Stanley Fujimoto, Data Scientist at Ancestry
The outcome
Toward its goal of weekly model releases, Ancestry uses Labelbox as a data engine to train models faster and evaluate validation sets more easily. For never-labeled data, a QA process in the platform speeds up extracting the precise location of historical records and analysis like handwriting recognition. The team can contextualize its data, maintain quality, save time, and train and test new models in record time.
Where this goes
Reading the world's historical record is a signal problem. Expert judgment, captured as structured signal, is what lets a model learn to decode it.