John ThomasNovember 4, 2020

Tutorial: Use model-assisted labeling to improve speed and accuracy

Model-assisted labeling uses your own model to accelerate labeling, improve accuracy, and helps you deliver performant ML models at a lower cost. Labelbox is designed to quickly and easily integrate your model into labeling workflows, and we’ve created the below tutorial to walk you through how to get started.

This tutorial follows an end-to-end Google Colab notebook we compiled to  explore how to train a basic segmentation model with PyTorch and apply the model predictions to the remaining rows in Labelbox.

You can find the Colab notebook here

Note: The code snippets in this article and the notebook are simplified to illustrate concepts rather than written with best practices for large scale production.

Getting started

To follow along with this tutorial, you should have two things:

  1. Either (A) a model that can generate predictions on your data, or (B) a project in Labelbox with an initial number of labels. For segmentation models this can be as low as a 300-400. For object detection models it should be around 1,000.
  2. A project with unlabeled data rows in labelbox that you want to put model pre-labels on.

For this example, we’ll be working with segmentation data, but the notebook contains methods for object detection as well.

Setup and Training a Model

If you don’t yet have a model, you’ll want to train a basic one to generate your predictions. If you do, all that’s necessary is setting up your labelbox client, and getting your labels.

import json
import labelbox as lb
PROJECT_ID = "ck7wos1ri5o9f0a00jb1oyqgc"
client = lb.Client(LB_API_KEY)
project = client.get_project(PROJECT_ID)
ontology, thing_classes = get_ontology(PROJECT_ID)
labels = json.loads(get_labels(PROJECT_ID))
split = int(VALIDATION_RATIO*len(labels))
val_labels = labels[:split]
train_labels = labels[split:]

We start by getting our data from Labelbox, and then splitting it into a training set and a validation set. From here, the Colab notebook downloads all the relevant images, sets them up in local files for training with the Facebook Detectron 2 model.

The full code for training the model is in the Colab notebook, and for more detail you can see preparing the datasets, training the model, and creating a predictor.

We recommend using your developing model for this step, in whatever stage of development it may be, since (after the very early stages) nobody’s model should be better than your own at your use case.

Preparing data for inference

When you have a model that can generate predictions, the first step is to get the data that needs to have predictions generated on it. In Labelbox, that’s everything in the dataset which hasn’t been labeled yet. To compute this, we pull all the datarow ids  in the project:

all_datarow_ids = []
all_datarows = []
for dataset_id in DATASETS:
   dataset = client.get_dataset(dataset_id)
   for data_row in dataset.data_rows():

And then find all the datarow ids we trained on:

datarow_ids_with_labels = []
for label in labels:
   datarow_ids_with_labels.append(label['DataRow ID'])

And get the difference of these two:

datarow_ids_queued = diff_lists(all_datarow_ids, datarow_ids_with_labels)

Generate and Upload predictions

At this point, all that’s needed is to build a list of the annotations to import:

for datarow in data_row_queued:
   extension = os.path.splitext(datarow.external_id)[1]
   filename = DATA_LOCATION+'inference/' + datarow.uid + extension
   im = cv2.imread(filename)
   ##Predict using FB Detectron2 predictor
   outputs = predictor(im)
   categories = outputs["instances"].to("cpu").pred_classes.numpy()
   predicted_boxes = outputs["instances"].to("cpu").pred_boxes
   if len(categories) != 0:
       for i in range(len(categories)):
           classname = thing_classes[categories[i]]
           for item in ontology:
               if classname==item['name']:
                   schema_id = item['featureSchemaId']
           pred_mask = outputs["instances"][i].to("cpu").pred_masks.numpy()
           cloud_mask = mask_to_cloud(im,pred_mask, datarow.uid)
           mask = {
'instanceURI': cloud_mask, 
"colorRGB": [255,255,255]
"uuid": str(uuid4()),
'schemaId': schema_id, 
'mask': mask, 
'dataRow': { 'id': datarow.uid }

And then import them into labelbox:

now = datetime.now() # current date and time
job_name = 'pre-labeling-' + str(now.strftime("%m-%d-%Y-%H-%M-%S"))
upload_job = project.upload_annotations(
assert (
   upload_job.state == BulkImportRequestState.FINISHED or
   upload_job.state == BulkImportRequestState.FAILED

Labeling with Imported Annotations

This basic method of pre-labeling can dramatically reduce labeling time and associated costs while helping you make significant improvements to your model performance. As your model improves, you can develop more sophisticated workflows, such as using our queue customization tools to have your labelers see the most impactful images first, or to get multiple datapoints on corner and edge cases.

Learn more about model-assisted labeling, or contact us to schedule a demo.