How Ancestry prioritizes collaboration and training data quality to enable genealogical breakthroughs with ML

Ancestry uses neural networks and transformers to extract genealogical data from historical artifacts. Labelbox's platform produces the expert-graded signal and the collaboration loop that got the team to weekly model iteration.

Note: This post is a shortened recap of a virtual talk from Stanley Fujimoto, Data Scientist at Ancestry during Labelbox Accelerate (Nov 2022).

The challenge

Ancestry combines billions of historical records, millions of family trees, and AncestryDNA samples. Its data science team uses neural networks and transformers to extract genealogical data from historical artifacts — decoding census data and training models faster. The team wanted to shift from building models to a data-centric approach and optimize its whole MLOps pipeline. The bottleneck was signal: data scientists owned labeling end to end, and when they didn't write the specs themselves, domain experts with deep knowledge of historical documents weren't easily looped into labeling and review. Getting data labeled and reviewed took forever.

The approach

Ancestry adopted Labelbox. Working in the native image and text editors, the team used model-assisted labeling, annotation relationships, and Labelbox's labeling services to produce signal faster — and brought its domain experts into the labeling and review loop.

Before Labelbox, we could train a model pretty quickly and evaluate against validation test sets, but getting data labeled and reviewed took forever. Having a strong collaborative annotation platform helped us get to a weekly iteration cycle”, said Stanley Fujimoto, Data Scientist at Ancestry.

Analytics and in-depth metrics showed how signal was being produced, so the team could communicate in real time and correct labels as needed, including annotating images and text in PDF documents at scale.

our team is able to collaborate more efficiently by dropping a pin on any image [asset], where we’ll have a question and we can respond to our labelers in line. For other platforms, the labeling process is a complete black box as we have to wait for all of our labels to come back before we can review and give any feedback. We've found that writing labeling specs always includes some level of ambiguity. For that reason, clarification and iteration speed is essential.”
— Stanley Fujimoto, Data Scientist at Ancestry

The outcome

Toward its goal of weekly model releases, Ancestry uses Labelbox as a data engine to train models faster and evaluate validation sets more easily. For never-labeled data, a QA process in the platform speeds up extracting the precise location of historical records and analysis like handwriting recognition. The team can contextualize its data, maintain quality, save time, and train and test new models in record time.

Where this goes

Reading the world's historical record is a signal problem. Expert judgment, captured as structured signal, is what lets a model learn to decode it.

How Ancestry trains models to read historical records faster

Problem

Solution

Result

The challenge

The approach

The outcome

Where this goes

Try Labelbox today