Data Labeling 101

What is data labeling?

Data labeling is defined as the task of detecting and tagging data with labels, most commonly in the form of images, videos, audio and text assets. The process typically involves human-powered work in order to manually curate, and in some cases, computer-assisted help. The types of labels are predetermined by a machine learning engineer and are chosen to give a machine learning model information about what is shown in order to teach the model from these examples. The process of data labeling helps machine learning engineers hone in on important factors that determine the overall precision and accuracy of their model. Example considerations include possible naming and categorization issues, how to represent occluded objects, how to deal with parts of the image that are unrecognizable, etc.

How do you label data and why is it important?

Data labeling is a central part of the data preprocessing workflow for machine learning. Data labeling structures data to make it meaningful. This labeled data is then used to train a machine learning models to find “meaning” in new, relevantly similar data. Throughout this process, machine learning practitioners strive for both quality and quantity. More accurately labeled coupled with a larger quantity of labeled data creates more useful deep learning models, as the resulting machine learning model bases their decisions on all the labeled data.

To illustrate from the example below, a person applies a series of labels on an image asset by applying bounding boxes to the relevant objects, otherwise known as image labeling. In this case, pedestrians are marked in blue and taxis are marked in yellow, while trucks are marked in yellow. Accurately identifying the cars from the pedestrians will yield a more successful model, which is defined as a model that can make accurate predictions when presented with new data (which in this case, are images of objects in a street view).

IA-overview

This process is then repeated and depending on the business use case and project, the quantity of labels on each image can vary. Some projects will require only one label to represent the content of an entire image (e.g., image classification). Other projects could require multiple objects to be tagged within a single image, each with a different label (e.g., bounding boxes).

How does a training data platform support data labeling?

Data labeling projects begin by identifying and instructing human labelers to perform labeling tasks (otherwise known as annotators). Annotators must be thoroughly trained on the specifications and guidelines of each annotation project, as every company will have different requirements.

In the specific case of images and videos, once the annotators are trained on how to annotate or label the data, they will begin labeling hundreds or thousands of images or videos on a training data platform. A training data platform is software that is designed to have all the necessary tools for the desired type of labeling and is commonly equipped with multiple tools which allow you to outline complex shapes at the pixel level.

In addition, training data platforms typically include additional features that specifically help optimize your data labeling projects which include:

  • High-performance labeling tools:

    An important point to consider and test is whether or not the tools provided by the training data platform supports a high number of objects and labels per asset without sacrificing loading times. At Labelbox, our vector pen tool allows you to draw freehand as well as straight lines. Blazingly fast and ergonomic drawing tools help reduce the time-consuming nature of having pixel-perfect labels consistently.

IS-pen-tool

Labelbox pen tool illustrated for accelerated labeling

  • Customization based on ontology requirements:

    The ability to configure the training data platform to your exact data structure (ontology) requirements enables you to ensure labeling consistency and scalability as your use cases expand. Labelbox provides a convenient way to copy your ontology across multiple projects so that you can make cascading changes or use an existing ontology as a starting point rather than starting from scratch.

ontology-pic

Configure the data labeling editor to your exact data structure (ontology) requirements.

  • A streamlined user interface which emphasizes performance for a wide array of devices:

    An intuitive design helps lower the cognitive load on labelers which enables fast image labeling. Even on lower spec PCs and laptops, performance is critical for professional annotators who are working in an editor all day.

UI-overview-pic

A simple, intuitive UI reduces friction when labeling many images or videos

  • Seamlessly connect your data via Python SDK or API for easy labeling:

    Stream data into your training data platform and push labeled data into training environments like TensorFlow and PyTorch. Labelbox was built to be developer friendly and API-first so you can use it as infrastructure to scale up and connect your ML models to accelerate data labeling productivity and orchestrate active learning.

sdk-api

Simplified data import without writing and maintaining your own scripts

  • Benchmarks & Consensus:

    Quality is measured by both the consistency and the accuracy of labeled data. The industry standard methods for calculating training data quality are benchmarks (aka gold standard), consensus, and review. As a data scientist in AI, an essential part of the job is figuring out what combination of these quality assurance procedures is right for your ML project. Quality assurance is an automated process that operates continuously throughout your training data development and improvement processes. With Labelbox consensus and benchmark features, you can automate consistency and accuracy tests. These tests allow you to customize the percentage of your data to test and the number of labelers that will annotate the test data.

Benchmarks overview

Benchmarks in action, highlighting the example labeled asset with a gold star

  • Collaboration and Performance Monitoring:

    Having an organized system to invite and supervise all you labelers during an image annotation project is important for both scalability and security. A training data platform should include granular options to invite users and to review the work progress of each annotator. With Labelbox, setting up a project and inviting new members is extremely easy, and there are many options for monitoring their performance, including statistics on seconds needed to label an image. You can implement several quality control mechanisms, including activating automatic consensus between different labelers or setting gold standard benchmarks.

Collaboration-overview

Seamless collaboration between data science teams, domain experts, and dedicated external labeling teams