How to define a task for your data labeling project

Large volumes of high-quality training data are crucial to the success of any machine learning model. A labeling project is where you orchestrate and manage all of your labeling operations within Labelbox.

The first step in the labeling process is to align on the key components of the labeling task within a project. This sets the tone of the project and allows Labelbox to make labeling more efficient down the line.

Defining the data

Data type

Creating a project and configuring your labeling task on the Labelbox platform begins alignment on which type of data needs to be labeled.

Labelbox supports the following data types:

Images
Videos
Text
Conversational text
PDF documents
Geospatial / Tiled imagery
Audio
HTML
DICOM (medical imagery)

Based on your chosen data modality, the Labeling Team Manager can leverage prior knowledge and the existing experience of different teams of labelers to work on your projects.

For instance, some workforce teams have an excellent track record of successfully labeling projects with a specific type of imagery, while others have extended experience in working with the video editor or in specialized text projects. By curating labeling teams who are already well-versed in your use case and match your project needs, they can get started quicker with little friction.

Industry verticals

Similarly, our labeling partners are experienced in various industries such as:

Manufacturing
Real estate
Food service
Agrotech
Healthcare
Insurance
Retail
and many more

Leveraging labelers who have experience in the industry of interest for your task should be discussed, as it allows the Labeling Team Manager to allocate the best workforce who meets your needs.

Some teams have large experience in aerial roof tagging for insurance companies, while others have been working on long-term microscopy pictures in the medical field, and many more variations of all kinds of tasks. Understanding the scope and frame of the task allows Labelbox to set your team up for success with the right workforce.

A workforce team who is well-versed in your industry or use case will need less time to get calibrated on your task. This means they'll be able to label more data in less time, which results in cheaper labeling costs as you only pay for the time when labeling screen-time occurs.

There might be some use cases where general experience in a specific industry or data type might not be sufficient enough to meet your requirements. Labelbox also offers the option for you to onboard expert labelers for your project needs. You can learn more below under the "Specific labeling requirements" section.

Leveraging Labelbox's suite of tools

Annotate

Labelbox's Annotate is designed to give you complete visibility and control over every aspect of your labeling operations across data modalities.

While setting up your labeling project, you'll need to acknowledge the supported file formats and annotation types in order to prevent issues down the line.

Similarly, it is important to understand how to use Annotate to set up your project and labeling task, collaborate with your internal or external teams, and how to ensure that you're minimizing labeling time and spend.

For instance, only one labeler can work per data row. If you have long videos to annotate, we might recommend splitting them into multiple files so more annotators can work on the data. Ultimately this will depend on your own speed and time requirements, however the Labeling Team Manager is available to work with you to determine what will work best for your team's use case.

You can learn more about Annotate in our documentation.

Catalog

Labelbox's Catalog is a data curation tool for you to organize, search, visualize, and explore your unstructured data.

Utilizing Catalog for data selection is a huge advantage in having a quality batch of data to label, according to specific parameters required for your task. You can leverage Catalog's features, such as filters, a one-click similarity search, metadata, and more, to ensure that the data you're queueing to your project is well-structured for your business requirements.

You can learn more about Catalog in our documentation.

Specific labeling requirements

Expert workforce

For more specific and specialized tasks, Labelbox has the ability to onboard labelers who are qualified in particular domains:

Technical labelers with determined skills, such as people with software programming certifications
Medical labelers like nurses, clinicians, dermatologists, neurologists, and surgeons
Labelers fluent in one of 20+ languages covered by our partners

Your Labeling Team Manager can source a specialist, based on the expertise needed, who can get started on your task. Since experts can take longer to source, it is essential to determine and request this requirement along with additional information needed prior to the start of labeling.

Compliance

As Labelbox partners are spread out across different countries, it's important that geographical location is acknowledged and discussed so your business requirements are met. Depending on your discussed compliance and project needs, the Labeling Team Manager will ensure that the right workforce is onboarded accordingly.

Labelbox partners are compliant with the following certifications:

SOC2 Type I
SOC2 Type II
GDPR
HIPAA

Labeling forecast

Volume

Defining your data volume is a key element to consider when outlining your labeling task. This allows early expectations to be set in context of throughput and subsequently helps the Labeling Team Manager and the workforce to organize the labeling in the most efficient manner.

For instance, the following aspects should be considered and discussed when defining your task:

Large volumes of data
Long-term projects
Short projects
Turnarounds
Uploads frequency

Based on the above, the Labeling Team Manager can allocate the task to the team most appropriate to meet the volume demands in terms of resources and availability.

Timeline

Understanding the timeline of your task is also crucial in effectively ramping up and scaling labeling activity. A rough outline of when you want the project to start and the target completion data helps define a structured labeling process and makes resource management easier.

A task that is set up for success will consider the following:

How many data rows do you plan to upload to your project? At which frequency?
What are the expectations in terms of speed?
Do you have a deadline?

Quick turnarounds on high volumes of data and tight deadlines can be delicate to navigate, so planning ahead and understanding timelines in advance can help maximize the resources available and the workforce's time can be used efficiently.

Creating an ontology

Another important aspect of a well-defined labeling task is the ontology. It needs to be built in a way that will work for the task at hand, and that follows the most logical workflow for a labeler. Ontologies and features should be created and managed with the goals of proper labeling, efficiency, and reusability in mind.

Components

Within an ontology, the three kinds of features are:

Objects (bounding boxes, polygons, segmentation masks, points, polylines, etc)
Classifications (radio, checklist, etc), that can be global or nested
Relationships (these are approached differently in Labelbox depending on the data type): With text data, you can define relationships between entity annotations as part of the objects. With image data, you can set relationship items in the ontology

A good ontology should define and answer the following:

What should the labeling team be labeling?
How should objects and/or classifications be labeled?
What additional information is helpful for your model?

Reusing ontologies can be useful if you're planning on having multiple projects for the same or a very similar use case. Elements can be added to an existing ontology without affecting labels, so an ontology is not set in stone and you're encouraged to test and fine-tune your ontology.

You can learn more about how to create and manage your ontologies in this guide.

Speed

You should choose tools that will allow labelers to label as fast as possible while maintaining the output needed for your model.

Sample questions to consider when selecting tools would be:

Would separate bounding boxes per class be better than one bounding box with a nested classification?
Is a segmentation mask necessary or would a polygon do?
Do you need every frame to be annotated?

Quality

High-quality training data is critical to the success of your model. Ensure quality by designing a clear and efficient ontology that makes it easier for you to organize the output.

Sample questions to consider include:

Does the ontology need to be extremely complex, or can you consider separate projects to label different things on the same data?
Is the free text field necessary? Avoid options that can lead to inconsistencies, typos and misspellings.
Is it best to skip or should there be an annotation denoting that the answer is unknown, that there was nothing to label? Do you want to understand why some assets do not meet the criteria, or is it fine to have a bucket of unlabeled assets?

Creating labeling instructions

Once all the modalities of the task have been properly defined, you have to provide labeling instructions to the workforce. Even with an extremely simple ontology, it is necessary to offer additional information related to the labeling task.

Labeling instructions compliment the ontology and can be in the form of a document or a video demo. You can include anything that you deem useful and relevant to explaining the rules of your labeling task in a way that is easy for labelers to follow.

Good instructions will go into detail with specific examples and clearly lay out major labeling rules. Labeling instructions should provide context to the task, explain what the task entails, describe the labeling steps, and serve as a "living document".

Instructions can be altered depending on task progress and any changes in your requirements. Since changes can be made, it is advised for you to keep track of what made a label meet your success criteria so you can tailor the instructions to help the team understand important criteria of what makes a "good label".

Defining labeling rules

Definition of all ontology items

It is important that you make sure to list and define the items that you want labeled. As mentioned in the "Creating an ontology" section above, you should be clear on the features and rules behind each expected annotation:

Objects
Classifications
Relationships

For example, if the project contains several entities/objects to be labeled with multiple classifications to choose from, explain each entity/object and each classification in sufficient detail:

How tight around the object does the bounding box need to be?
How do labelers pinpoint a precise point on a blurry image with shapes that are not sharp?
Is there a maximum number of points a polygon can have before it becomes excessive for your model?

When defining your labeling rules, you should aim to:

Provide clear definitions of all concepts for text projects to prevent ambiguities in the labeling
Try to format your instructions in a way that is easy to read, and make sure the golden rules pop out in a clear and distinguishable manner to minimize any chance that these could be missed
Define your approach on a frame basis for video projects

Step-by-step labeling workflow

Make sure to describe each step in the labeling workflow so that your labelers are not lost in the ontology:

For projects with multiple objects per asset, is there a specific order in which you need the annotations to be added?
Describe your expectations for the review process

Examples

The best way to convey the results you want is by providing clear examples of the data to the labelers in the form of screenshots in your instructions. There are several approaches to this:

Provide screenshots of unlabeled and labeled data for the labelers to have an overview of the assets they will be working on and what the outcome should be. Try to include several images that represent the variations of the full set.
Clarify the variability of the data by sharing “edge cases” in your guidelines. An asset that would stand out from the rest of the set should be explained in detail, so the labelers know how to approach these types of cases.
Include incorrectly labeled (negative) examples as well. This helps the team to identify mistakes to avoid. Common mistakes can be mentioned to prevent making them in this task.

All of these key components contribute to defining a task that is set for success on the Labelbox platform. These aspects also help the Labeling Team Manager ensure that your project outcomes are successful.

After the initial task setup, the next step in your labeling journey is to define your success criteria. Along with your volumes and deadlines, learn more about how to get a notion of the average time per label, describe how a "good label" is measured, and learn about SLAs in our next guide: How to define your data labeling project's success criteria.