
Targeting model weaknesses to improve Deque's accessibility AI
Problem
It takes about 10 minutes to fully annotate a single webpage screen for accessibility compliance. Across thousands of web and mobile datasets, the Deque team amassed a trove of data and needed to automate manual effort and target the signal that mattered.
Solution
Deque used Labelbox Model Diagnostics and Catalog to target its model's weaknesses and detect noise in its datasets.
Result
Deque rapidly filtered out 1/3 of data points it considered less trustworthy to improve model performance by 5%+. Annually, it now reduces its labeling spend and needs by over 50%.

Deque builds AI for accessibility testing. Labelbox's Model Diagnostics and Catalog let it find model errors, prioritize the right signal, and cut labeling needs over 50%.
The challenge
Deque has spent decades democratizing digital accessibility — making documents, web, and mobile apps usable by everyone, including people with disabilities. Now it's using machine learning to lead the next generation of accessibility testing. Building that ML program is hard. It takes about 10 minutes to fully annotate a single webpage screen, and across thousands of web and mobile datasets that adds up. Before Labelbox, the team relied on disparate open-source annotation tools and hacked-together Jupyter Notebooks and Google Sheets — with no easy way to find model errors or prioritize what to label.
The approach
Deque built a data engine on Labelbox to prioritize the most performant classes of data, find model errors fast, and fuel iterations with high-quality signal. Model Diagnostics let the team evaluate and visualize model performance instead of computing metrics by hand.
Before using Model Diagnostics in Labelbox to target the model’s weaknesses, we had to visualize the predictions on our own and everything was much more manual,” said Noé Barrell, ML Engineer at Deque. “We had to calculate all these metrics on our own, and it was a disjointed and difficult process. Being able to convert this workflow into something we could now do simply within Labelbox made diagnosing errors a much more streamlined experience. It’s become so much easier to iterate.
We detected some noise issues in our dataset and thanks to Model Diagnostics, we were able to filter out about one-third of data points we considered less trustworthy,” said Barrell. “By doing so, model performance went up 5%. We re-labeled some data and we saw the performance went up again after we added the re-labeled points. It was challenging data for humans to label and for the model to understand. Being able to target the data we already had in Labelbox and make changes and fixes to it was really helpful to us as a team to save time and target where we knew it would make a difference in our model’s performance.
Catalog let the team search, discover, and prioritize the right data, using embeddings for unsupervised classification and easy batching.
The Catalog feature in Labelbox is also huge for us. Pre-Catalog, for our data selection process, we’d look at the performance metrics of our model and let’s say, for example, we discovered it was indicating 50% accuracy on models, we would have to tediously and manually collect data surrounding that. But with the Catalog feature in Labelbox, we can target data collection for our models easily and quickly. Embeddings allow us to do unsupervised classification of models and select a lot of models. It’s just easier to create batches and sample around that. It takes a lot of the time and effort out of the data selection process,” said Barrell.
Being able to reduce the data requirement is huge because you can see the same amount of improvement in the model’s performance in half the time and with half the effort. That was enabled through targeting the model’s weaknesses with Model Diagnostics and then being able to prioritize the right data through Catalog. Before, if we were doing a scattershot data collection, we would have been roughly labeling twice as much data and with twice as much effort,” said Barrell.
The outcome
Targeting data collection at model failures, Deque made big accuracy jumps — improving detection of checkboxes from 47% to 75% accuracy, presentational tables from 66% to 79%, and radio buttons from 37.9% to 74%. It filtered out about one-third of less-trustworthy data points to lift model performance 5%, and re-labeling pushed it higher. Overall, Deque cut its annual labeling spend and needs by over 50%.
Where this goes
Accessibility features are for everyone — captions help in noisy rooms as much as for the hard of hearing. A model that knows its own weaknesses, fed exactly the signal it's missing, is how you make the web accessible at scale.