logo
×

Programmatically launch human data jobs for RLHF and evaluation

Introduction

Labelbox’s Python SDK provides AI teams with a powerful approach to orchestrate human data labeling projects. In this guide, we’ll walk through how to harness the Python SDK to manage human data labeling jobs for RLHF and evaluation. With just a few steps, you can set up the SDK, import various types of data, and launch, monitor, and export labeling projects programmatically, all while ensuring data quality and scalability.

Getting started: Set up the Labelbox Python SDK

Let's begin by first setting up the Labelbox Python SDK in four simple steps:

1) Create an API key to start using Labelbox Python SDK

2) pip install "labelbox[data]" in terminal or !pip install "labelbox[data]" in your notebook

3) Authentication can be done by saving your key to an environment variable:

user@machine:~$ export LABELBOX_API_KEY="<your_api_key>"
user@machine:~$ python3

4) Then, import and initialize the API Client.

import labelbox as lb 
client = lb.Client()

Importing your data into Labelbox: Methods and supported formats

Now that the SDK has been set up,  let's look at an example of uploading LLM response evaluation data for RLHF:

# Create a dataset
dataset = client.create_dataset(
    name="RLHF asset upload example"+str(uuid.uuid4()),
    iam_integration=None
)
# Upload assets
task = dataset.create_data_rows([
    {
      "row_data": "https://storage.googleapis.com/labelbox-datasets/conversational-sample-data/pairwise_shopping_1.json",
      "global_key": str(uuid.uuid4())
    },
    {
        "row_data": "https://storage.googleapis.com/labelbox-datasets/conversational-sample-data/pairwise_shopping_2.json",
        "global_key": str(uuid.uuid4())
    },
    {
        "row_data": "https://storage.googleapis.com/labelbox-datasets/conversational-sample-data/pairwise_shopping_3.json",
        "global_key": str(uuid.uuid4())
    }
  ])
task.wait_till_done()
print("Errors:",task.errors)
print("Failed data rows:", task.failed_data_rows)

Learn more about all supported data types and editors here.

Creating an ontology using the SDK 

With the data imported, the next step is to create your ontology for the project. The ontology defines the structure and relationships within the data for your labeling process. Below is an example of how to create an ontology using the Labelbox Python SDK:

import labelbox as lb
ontology_builder = lb.OntologyBuilder(
  classifications=[ 
    lb.Classification( 
      class_type=lb.Classification.Type.TEXT,
      scope=lb.Classification.Scope.INDEX,  
      name="Free form text example"), 
    lb.Classification( 
      class_type=lb.Classification.Type.CHECKLIST, 
      scope=lb.Classification.Scope.INDEX,  
      name="Checklist example", 
      options=[
        lb.Option(value="first_checklist_answer"),
        lb.Option(value="second_checklist_answer")            
      ]
    ), 
    lb.Classification( 
      class_type=lb.Classification.Type.RADIO, 
      name="Radio example", 
      scope=lb.Classification.Scope.INDEX, 
      options=[
        lb.Option(value="first_radio_answer"),
        lb.Option(value="second_radio_answer")
      ]
      
    ),
     lb.Classification(
    class_type=lb.Classification.Type.RADIO,
    name="Rank #1", # More ranks can be created like this  with N number of options
    required = True,
    options=[
      lb.Option(value="Option 1"),
      lb.Option(value="Option 2"),
      lb.Option(value="Option 3"),
    ]
    ,)
  ]
)


ontology = client.create_ontology("RLHF classification example", ontology_builder.asdict(), media_type=lb.MediaType.Conversational)

For more information about ontology creation, please refer to the documentation for more examples.

Best practices for ontology design

Leverage existing ontologies wisely

Labelbox allows users to reuse ontologies from previous projects, saving time and ensuring consistency across related tasks. However, be cautious when modifying shared ontologies:

  • Copy existing ontologies: To prevent unintended changes to previous projects, create a copy of an existing ontology. This creates a new schema node while retaining all your classes.
  • Users can customize the ontology for their current project. After copying, they can freely modify the ontology to suit the new project's needs without affecting earlier work.

Optimize object ordering for logical workflows

The order of objects in the ontology can significantly impact the labeling process:

  • Prioritize common objects: Create the most frequently used objects first. They'll appear at the top of the list, making them easily accessible to labelers.
  • Design a logical flow: For complex tasks like model response comparisons, structure the ontology to guide labelers through a step-by-step analysis:
  1. Start with individual model evaluation criteria.
  2. Place comparative questions (e.g., "Which model response is best?") at the end.

This approach ensures labelers have thoroughly analyzed each option before making final comparisons.

Enhance visual clarity with color coding

Improve the visual experience for labelers:

  • Consistent color schemes: Assign and edit colors for each object in the ontology.
  • Maintain color consistency: Use the same colors throughout the project to reduce cognitive load and improve labeling speed and accuracy.

Provide easy access to labeling instructions

Make sure labelers have all the information they need at their fingertips:

  • Attach PDF instructions: Upload labeling guidelines as a PDF document.
  • Side-by-side viewing: Labelers can reference the instructions within Labelbox, displayed alongside the project for convenient access.

Use advanced classification features

Take advantage of Labelbox's classification capabilities to create more nuanced and accurate labels:

  • Implement nested classifications: This allows for more detailed object identification. For example, after drawing a segmentation mask over a tree, labelers can further classify it as healthy or unhealthy.
  • Set required questions: Ensure critical information is always captured by making certain questions mandatory for each asset.

By following these best practices, users will create more efficient labeling jobs, leading to higher quality data and improved model performance.

Labelbox labeling services 

For Enterprise plan users, Labelbox offers data labeling services, connecting them with professional labelers to process large amounts of data quickly and efficiently. Key features include:

1) Rapid Data Processing: Quickly handle large volumes of data without the overhead of hiring additional staff.

2) Specialized Expertise: Access labelers with specialized knowledge, including:

      • Medical experts
      • Various language specialists
      • Other certified specialties

3) Flexibility: Scale your labeling service up or down based on project needs without long-term commitments.

4) Quality Assurance: Professional labelers are trained to maintain high standards of accuracy and consistency.

5) Time and Resource Savings: Eliminate the need for recruitment, training, and management of an in-house labeling team.

By leveraging labeling services, enterprise users can significantly accelerate their data labeling projects, especially when dealing with complex datasets or when requiring domain-specific expertise. This service complements Labelbox's robust data import and management capabilities, providing a comprehensive solution for large-scale AI and machine learning projects.

To leverage labeling services, Labelbox provides programmatic methods to request labeling services as, shown here:

1) Getting labeling service information: Users can retrieve information about the labeling service for a specific project:

labeling_service = project.get_labeling_service()
print(labeling_service)

This will return details such as the service ID, project ID, creation date, status, and more.

2) Requesting labeling services for faster results: Once data and an ontology with instructions has been added, users can initiate a boost request:

labeling_service.request()

This call initiates the labeling services service for your project.

3) Monitoring your labeling service’s status:The Labelbox labeling service requested can be easily monitored via the UI as shown below:

Simple export via the Labelbox SDK

Once the labeling project is complete, users can easily export the labels using the SDK, as shown below.

export_task = project.export(params=export_params, filters=filters)

This simple command allows users to retrieve labeled data that is ready for use in machine learning pipelines. Please refer to documentation for flexible ways of exporting a project with filters.

Conclusion

The Labelbox Python SDK offers teams with a convenient and powerful way to programmatically manage human data labeling projects. By providing control over every aspect of the labeling process - from data import and ontology design to project monitoring and data export - the SDK enables AI teams with the ability to incorporate high-quality labeled data into their workflows seamlessly.

We hope you found this guide helpful for gaining a deeper understanding of how to capitalize on an SDK-driven approach to simplify complex tasks and enhance productivity. Whether you’re working on small-scale projects or large, distributed labeling efforts, the Labelbox SDK offers the full-suite of tooling needed to efficiently manage your  data labeling needs and accelerate their AI development process.

If you're interested in implementing an SDK approach to jumpstart your human data jobs for RLHF and model evaluation, sign up for a free Labelbox account to try it out, or contact us to learn more.