Benchmarks (Gold Standard)

Benchmarks (also known as Gold Standard) is a quality assurance tool for training data. Training data quality is the measure of accuracy and consistency of the training data. Benchmarks works by interspersing data to be labeled, for which there is a benchmark label, to each person labeling. These labeled data are compared against their respective benchmark and an accuracy score between 0 and 100 percent is calculated.

Benchmarks will be available for new Image Editor projects in early 2020. You can still use the legacy Image Editor to get access this feature.

Use Benchmarks to ensure the labeling team is accurately labeling data initially, and throughout the lifecycle of the training data. Benchmarks are created and managed from the Labels > Benchmarks view. Benchmarks results are shown from the Labels>Benchmarks view as well as individually for each team member from the Performance view.

Setting up Benchmarks

  1. Start with an existing project or create a new project.
  2. Configure the project to use Benchmarks: go to Settings > Quality and select Benchmark.
  1. Create a benchmark from an existing label by clicking the star icon on the top bar.
  1. With a benchmark defined, the underlying source datum will now be served to every labeler. Each label for that datum will be scored against the benchmark label. View the average score of each Benchmark from Labels>Benchmarks.
  1. View the detailed results of a Benchmark by clicking on View Results.

How Benchmarks test data is distributed to labelers

When a team member starts labeling for the first time in a project, they are served five benchmark test data. From there, each benchmark test datum is served up in a semi-random but spaced apart manner.

Understanding labeler performance

Each team member’s work is quantified and shown on the Performance tab. To see the details for a particular team member, click on their row to expand it.

Systemic poor labeler performance Systemic poor labeler performance is often indicative of poor instructions, while poor performance on certain pieces of data is often indicative of edge cases. Use these values to help you improve your onboarding and education processes.

Was this page helpful?