Table of Contents


Consensus is a Labelbox QA tool that compares a single Label on an asset to all of the other Labels on that asset. Once an asset has been labeled more than once, a Consensus score is automatically calculated. Consensus works in real time so you can take immediate and corrective actions towards improving your training data and model performance.

While you may collect Consensus votes on any data type, Consensus score calculations are not yet supported for the following:

  • Points
  • Polylines
  • Video labeling
  • Text labeling


Labelbox follows a similar methodology for calculating the agreement scores for both Benchmarks and Consensus. The only difference in the calculations is the entity to which the Labels are compared.

Whenever a Label is created, updated, or deleted, the consensus score will be recalculated as long as at least 2 Labels exist on that data row. Recalculations may take up to 5 minutes or so depending on the complexity of the Labels.

Bounding boxes, polygons, and masks

Generally speaking, calculating agreement for the polygons of a Label involves Intersection-over-Union and a series of averages to calculate the final agreement between two Labels on an image.


There are four global classification types: radio, checklist, text, and dropdown. The calculation method for each classification type is different. One commonality, however, is that if two classifications of the same type are compared and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

A Radio classification can only have one selected answer. Therefore, the agreement between two radio classifications will either be 0% or 100%. 0% means no agreement and 100% means agreement.

A Checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

A Dropdown classification can have only one selected answer, however the answer choices can be nested. The calculation for dropdown is similar to that of checklist classification, except that the agreement calculation divides the number of overlapping answers by the total depth of the selection (how many levels). Answers nested under different top-level classifications can still have overlap if the classifications at the next level match. On the flip side, answers that do not match exactly can still have overlap if they are under the same top-level classification.

Object + Child classification

When a Classification is nested under an Object annotation (e.g. Bounding box with a checklist classification), Consensus scores for Object and Classification annotations are computed separately and averaged together.


  1. Create a project or select an existing one.
  2. Navigate to Settings > Quality and select Consensus to turn this QA feature on for your project.
  3. Choose the “Coverage” percentage and the number of “Votes”. The number of Votes indicates how many times the assets in the Coverage percentage get labeled. For example, 25% of the assets will get labeled 3 times.

View results

The chart at the bottom of the Overview tab displays the Consensus scores across all labels in the project. The x-axis indicates the agreement percentage and the y-axis indicates the label count.

The Consensus column in the Activity table contains the agreement score for each label and how many labels are associated with that score. When you click on the consensus icon, the Activity table will automatically apply the correct filter to view the labels associated with that consensus score.

When you click on an individual labeler in the Performance tab, the Consensus column reflects the average Consensus score for that labeler.

Was this page helpful?