How to define your data labeling project's success criteria
Every ML project begins with the desired outcome of creating "high-quality" training data. But what does that really mean in practice and how exactly do you know when that has been achieved?
Annotation requirements often evolve throughout the lifetime of an ML project, so it's necessary to define your success criteria at the outset of a project and be nimble enough to recognize when requirements may need to be adjusted as things evolve.
Leveraging a workflow that allows for labeling to be done in small batches and that begins with a calibration phase can help clearly define "high quality" and ensure that quality is maintained.
It's crucial that you take time at the beginning of a project to set expectations on labeling speed, acceptable errors, and determine how quality is defined in order to set you up for an easy to follow quality SLA. This will ensure that your labeling team fully understands the task at hand and the level of performance expected from them. While it may seem like a lot to think through at the onset of a labeling project, it will create mutual understanding that leads to a consistent output with high quality.
Timeline and scope: Your deadline, data volume, and the average time per label
One of the first steps in helping you get a handle on how you define success is understanding the scope and timeline of your project. Some questions to consider as you plan for high quality labels are:
- Does this project have a definite volume with a deadline?
- Is there no specific deadline, but a fixed volume?
- Do you have a notion of how long you expect it to take to label each asset?
Regardless of if there is a predetermined volume with a deadline or not, throughput should be part of your success criteria. Knowing your deadline and volume upfront can set expectations for your labeling team that they will need to complete 'x' amount of assets per day or week. If you have a fixed volume, but no deadline, this is when a calibration phase comes in handy as an initial batch can be used to determine expected throughput. If you have a realistic expected time per asset, that can be used as well.
A project is considered successfully completed when it is finished in a timely manner or by a given deadline. Understanding labeling throughput and setting expectations allows you and your labeling team to have a mutual understanding of success.
What is considered "good quality" and how is it measured?
It may be enticing to say that your project has been successful when labels have simply been deemed "good" and of "high quality". However, this needs to be defined further so that your labeling team knows what "good" looks like and how to achieve it.
If your labeling project requires a simple binary label, it is easy to understand what is good (correct) and what is bad (incorrect) labeling. As the complexity of your labeling task grows, so does the need for assigning value to errors and stating specific requirements for what is "good".
As the owner of your project and labels, you'll need to select the grading requirements for your project. In essence, you have to decide how each asset will be evaluated to have passed or failed in its initial labeling. For this, we suggest creating a definition of what causes an asset to be fully rejected.
For example, let's say you have a computer vision project that requires multiple classes of bounding boxes per image. If 1 of 7 bounding boxes is mislabeled on an image, is this considered a grave enough error that results in the whole image being rejected? Or is it a lesser error where you would simply create an issue for correction, but not necessarily reject the entire image? Perhaps missing a bounding box entirely is a more serious error than a misclassification and that's a criteria for what leads to rejection.
It is helpful to outline grave or fatal errors, as well as the lesser errors at the outset of a project or following your review during calibration. Defining these nuances will help your labeling team better understand your feedback and know where to focus their internal QA. We recommend making a list of these errors and keeping it handy while you review. Not only will this allow you to focus your review as well, but you can copy/paste these named errors in whatever feedback format you use to easily be able to track trends.
SLAs (quality, speed, throughput)
Quality SLAs (service-level agreements) with your labeling team are a bidirectional commitment that is mutually agreed upon. Quality SLAs are built on throughput and quality calculations.
Should you want to implement a quality SLA on a project, we recommend starting with a calibration phase to allow you to get a real world grasp on throughput (time per individual asset or amount of assets per week) and the types of errors that are surfaced when labeling is done by a team who is not intimately familiar with your project.
Determining and setting a percentage for quality is at your discretion based on the type and amount of errors that you find during the first calibration phase. We suggest discussing this with the labeling team to ensure you're setting a threshold that is achievable given the complexity of your task.
When it comes to tracking whether or not a SLA is being met, throughput and quality are the two components you’ll be looking at:
- Throughput can easily be gauged by ensuring the stated number of assets are in fact being completed on the cadence (daily, weekly, etc) agreed to.
- As for ensuring that the quality necessary is being achieved, this will require your continued review of a sample throughout the lifespan of your project. This is when having a reusable list of error types comes in handy, while you review the assets that received serious errors will be considered rejected. Calculating quality can be as simple as dividing the number of assets rejected by the total number you reviewed. Having a clear way of understanding how many assets of a reviewed sample were rejected and why will help your labeling team understand if “good quality” is being achieved.
Below you will find a table that outlines the milestones of implementing a quality SLA and assigns responsibility for each to a particular party. As stated, a well executed quality SLA is bidirectional and involves the continued involvement of the project owner, ensuring good quality labels is a shared responsibility.
After defining and setting expectations on how quality is defined, you'll want to spend some time to develop a quality strategy. Your quality strategy will resemble how you will monitor and ensure that your project ends up with high quality data. This can include the use of automated quality monitoring features, manual review and feedback, regular two-way communication with your labeling team, and more.
Learn more about how to create a quality strategy for your data labeling project in our next guide.