Labelbox•October 8, 2021
The concept of data-centric machine learning, promoted widely by leaders in AI such as Andrew Ng in recent months, is quickly taking root in the enterprise machine learning community.
Enterprises are treating their training data as IP because it’s the only component of machine learning that requires creativity and domain expertise to collect, organize, and curate. In this post, we’ll dive into some of the specific challenges of data-centric machine learning, and how your team can address them by employing active learning in ML methods.
Taking a data-first approach to machine learning comes with its own specific challenges in data management, data analysis, and labeling.
Active learning in ML can help alleviate all three of these issues. Through automation, ML teams can improve data management, gain better insights into their data, and improve the rate of iteration.
One of the foundational steps for building an active learning pipeline is to bring all the data relevant to your project, along with metadata such as embedding, previous annotations, and model predictions, together. This enables teams to access their data quickly at any time, and iterate faster.
Rather than sending all the data off to be labeled at once — usually an expensive and time consuming approach — teams should instead pull together a small, curated dataset to build a baseline version of their model. Choose assets that will help the model find a general understanding of the task at hand, and ensure that all major classes of interest are represented.
If it’s available and applicable, teams can pull in an off-the-shelf model to establish a model-assisted labeling workflow. Pre-labeling images can save labelers a significant amount of time and bring the team to their first iteration much faster.
ML teams often fail to measure their model’s performance early and often. The baseline model for active learning is a perfect point to pause and understand the nuances of its performance. Teams can then use that information to communicate with stakeholders and pivot their strategy if necessary. They may also choose to train a couple of different variations here to find the best possible baseline version.
Because ML teams have full control over the data they use for training their model, they should focus efforts on powering the iterative cycle over optimizing the model at an early stage. The baseline model for active learning will also help to identify errors, either with the data or the model itself, which can be leveraged for further iterations.
Teams can now establish a comprehensive iterative process, involving a detailed diagnosis of the model’s performance, a new labeled dataset informed by model performance as well as any adjustments from stakeholders, and automated workflows like model-assisted labeling that increase label quality and the speed of the iterative cycle.
There are four types of model errors that commonly occur, and can be addressed as soon as they emerge when following this process:
With insights from model errors analysis and their baseline model, teams can once again deliberately select assets from their larger pool rather than randomly sampling data. To learn how you can use Labelbox to diagnose your model’s errors and curate your next labeled dataset, watch our webcast.