Auto Consensus enables you to quantitatively measure the quality of your training data — this is important because high quality training data leads to performant AI.
How it works
Auto Consensus works by having more than one labeler (human or machine) label the same asset (image, text string, video, etc…). Once an asset has been labeled more than once, the results can be compared quantitatively (by using an equation) and a consensus score is calculated automatically. Auto Consensus works in real time so you can take immediate and corrective actions towards improving your training data and model performance.
Every asset that has been labeled more than once within a Labelbox project has a consensus score. The results of the consensus data is shown in the Consensus chart in the project overview.
Individual asset consensus
The consensus of each asset is shown in the Activity table. To view all of the label results for a particular asset, click the stack icon to filter the Activity table on that asset. Clicking one of the table rows will begin a review of all labeled instances of this asset.
Consensus score calculations
The consensus score calculation compares the annotations of a labeled asset against existing labeled instances of the same asset. The set of existing labeled instances is referred to as the field.
To demonstrate how the consensus score is calculated, let’s look at a basic example using a classification labeling task: Select the aircraft model.
The consensus score for this label is calculated against the field of existing labels. In this example case, this image has been labeled by 3 labelers; one of them selected Airbus A350 and the other two selected Boeing 787.
For this classification task, the consensus of each label is calculated by dividing the count of labels that agree with the label by the total count of labels. For both of the Boeing 787 selections, the consensus score is 50% because the count of other labels (in the field) that agree is 1 and the total count of other labels is 2.
Classification consensus scoring
To calculate the consensus for a multi-question classification label submission, we add up the consensus score for each question and divide by the number of questions. To calculate the consensus score for a question, we use the following equation:
The consensus score for a multi-question classification label submission is therefore:
Using the equation above, we can calculate the consensus scores for Arian:
We do a similar calculation for Julia and Taylor (per question score calculation detail hidden for brevity):
Configuring Auto Consensus
There are two main settings for Auto Consensus:
- The percentage of assets to be labeled more than once (referred to as the Consensus Subset)
- The labels per asset the Consensus Subset will be labeled by unique labelers
A Simple Example
Gal is setting up a project in Labelbox and attaches 16,000 images. She is working with a team of 5 labelers. To ensure labeling quality she wants 10% of the data rows to be labeled by 3 out of the 5 labelers. Auto Consensus is configured to represent this by setting the Percentage of assets to 10% and Labels per asset to 3. With this configuration set, 1,600 of the images (10% of the 16,000 images) will be labeled 3 times.
To complete the initial labeling for this project, 17,600 images must be labeled in total. After one day, 3,000 images were labeled. Therefore, in order to finish the initial labeling task, 14,600 images still need to be labeled. The Labelbox project overview stats will say 3,000 Submitted, 14,600 Remaining, 0 Skipped, and 17% Complete.