How to prepare and submit a batch for labeling
High-quality training data is crucial to the success of any ML project. In order to improve model performance, it is crucial for teams to not only queue data for labeling, but to prioritize specific data in order to enable faster iteration cycles and decrease labeling cost.
Labelbox’s Catalog is a single place for teams to upload their datasets, quickly visualize their data, and make informed decisions on what data to prioritize for labeling.
What are Batches?
While Batches will replace dataset-based queueing, datasets in Labelbox are not going away.
In order to upload data, teams will still need to upload their relevant data as datasets. From there, teams can filter and add relevant Data Rows to their labeling project in groups through batches. A batch is a collection of Data Rows from Catalog that can be queued and added to your labeling project.
Once your dataset is in Catalog, you can filter and sort Data Rows to determine which specific ones you’d like to add to your project as a batch. If you want to send a whole dataset to your queue, you can simply select all Data Rows within the dataset and send it as a batch.
Why is Batch-based queueing so powerful?
Machine learning teams often have tons of unlabeled data — it can be incredibly time consuming and expensive to label all of your data. An important question becomes how to smartly decide what data to label and prioritize in order to accelerate model development.
Rather than queueing an entire dataset for labeling, queuing Data Rows with batches gives teams greater control and flexibility in the prioritization of a project’s labeling queue.
When creating and adding a batch to a new Benchmark or Consensus project, you’ll need to name and set the batch’s priority from 1-5 (with 1 being the highest). If you’re creating a batch for a Consensus project, you have the ability to enable or disable Consensus for that batch. You can also set your coverage percentage and set the number of labels.
With no limit to how many batches you can add to your labeling project, teams have ultimate flexibility to prioritize certain edge cases and create ad-hoc labeling tasks without needing to modify entire datasets every time.
Learn more about batch-based queueing in our documentation.
How to create and queue a batch for labeling
When you create a new project in Labelbox, you'll be prompted to queue your data for labeling through a batch.
How to create & queue a batch for labeling (Benchmark)
How to create & queue a batch for labeling (Consensus)
Learn more about how to create a batch in our documentation.
Manage and view batches within a project in the Data Rows tab
You can add, view, and manage batches directly in the Data Rows tab, making it easy for teams to have a holistic view of their entire labeling operations.
How to add, view & manage batches for a Benchmark project
How to add, view & manage batches for a Consensus project
How to delete a batch
How to do this through our SDK:
What unique workflows can Batches unlock?
Sample your data to make data selection faster
With larger datasets, you may want to sample data in order to make the data selection process faster and more efficient. In Labelbox, you can either randomly sample a number of Data Rows or choose to do an ordered sample.
You can learn more about sampling methods in our documentation.
Random sampling
Random sampling can help reduce selection bias when teams are deciding which Data Rows they want to send for labeling.
- Click “Sample” at the top-right of Catalog
- Choose “Random” from the top dropdown
- Specify the number of Data Rows that you wish to sample
- Select the project, name your batch, and set the batch’s priority
- Click “Submit”
Ordered sampling
Ordered sampling can be helpful if you want to quickly sample and send x number of Data Rows to a project.
- Click “Sample” at the top-right of Catalog
- Choose “Ordered” from the top dropdown
- Specify the number of Data Rows that you wish to sample
- Select the project, name your batch, and set the batch’s priority
- Click “Submit”
Better find and fix labeling or model errors
Your model is only as good as the data that it gets trained on – high-quality labels are crucial for training your model.
You can use Labelbox Model to surface Data Rows where your model predictions and ground truth labels disagree and view them in Catalog. Once in Catalog, you can send the subset of poorly labeled data rows to your labeling project. You can mark the batch as “high-priority” since fixing these errors will dramatically help improve model performance.
Learn more about finding and fixing label and model errors in the below guides:
Prioritize specific data to improve labeling accuracy & model performance
It is known that not all data will impact model performance equally. In a sea of unlabeled data, teams will want to decide what data to label in priority.
If you notice that your labelers are struggling on a specific image or object annotation, you can find similar data and send it to your labeling project as a batch to improve labeling accuracy.
If you notice that your model is struggling on a specific class, you can send more similar data to be labeled so that your model performs better on the newly added data points.
FAQ on Batches
To add a batch to a project, am I still required to upload a dataset to Labelbox?
Yes – you’ll still need to upload data as a dataset in Labelbox. After you upload a dataset, the dataset will live in Catalog.
From there, you can filter and choose to send specific Data Rows to a project as a batch. You can also send the entire dataset as a batch to a project if needed.
How many Data Rows can you have in a batch?
You can now have up to 100k Data Rows in a given batch. There is also no limit to how many batches you can add to a project, giving teams the ability to add all necessary data rows to a project for labeling.
Can I submit the same Data Row to a project multiple times?
A Data Row cannot be part of more than one batch in a project at a time. This is to help teams prevent having duplicate data rows within a project.
Can a batch be shared between multiple projects?
A batch cannot be shared between multiple projects. However, if you wish to use the same set of Data Rows for another project, you can create a new batch using the same Data Rows.
How do I set Consensus at the batch level?
We enable you to choose a quality mode when creating your project – between Benchmarks or Consensus.
If you’ve selected Consensus for your project, you can configure Consensus for batches that you add to your project.
When creating a batch for a Consensus project, you’ll see 3 configuration settings:
- A toggle to enable / disable Consensus for that batch
- A slider to set the coverage percentage
- A place to enter the number of labels
The slider and entering the number of labels replaces the old Labeling Parameter Overrides (LPO) feature as it enables you to customize the assets in the queue at the batch level.
Can I enable Benchmarks AND Consensus on a project?
Having both Benchmarks and Consensus batches on a project is not yet supported.
I don’t want to use Batches for my project right now. What should I do?
Batches can enable data-centric iterations driven by prioritizing high-impact data. If you don’t want to add specific Data Rows to a project as a batch, you can add your entire dataset to a batch for labeling (up to 100k data rows).