The importance of large-scale data uploads
Get your largest datasets into Labelbox in seconds
Best practices for large-scale data uploads
Resources

Labelbox•October 26, 2023

10X faster uploads: Labelbox’s data ingestion upgrades and how to use them

The importance of large-scale data uploads

The rapid adoption of AI, fueled by advancements in foundation models, LLMs, and generative AI, has led to an explosion in data needs. As companies now routinely deal with hundreds of millions or even billions of data points — images, videos, text, PDFs — the ability to effectively manage, curate, and label data has become critical.

Advanced AI teams typically start by consolidating all of their data into a single, powerful platform that enables effective data exploration, visualization, and management from the get-go. A robust data platform is also essential as models go into production — to capture and analyze production data in order to evaluate model performance, find challenging slices of data, and identify new labels for continuous model improvement.

Labelbox is the ideal data platform to meet the growing demand for data and AI needs. And at the root of it all is fast, reliable, and scalable data ingestion. Behind the scenes, we’ve invested significant effort into supercharging our data ingestion capabilities, resulting in 10X faster performance. In this post, we’ll give an overview of these new ingestion pipelines and share best practices for large-scale data ingestion uploads into Labelbox.

Get your largest datasets into Labelbox in seconds

We’ve been hard at work behind the scenes to completely overhaul our data ingestion pipelines. The results? Fast, ultra-reliable data uploads capable of handling large data volumes.

Here are just a few advantages that our new pipelines unlock for your data:

Blazingly fast uploads and processing: Get data, metadata, attachments, embeddings into Labelbox 10x faster than before.

Limitless scale: Our new architecture can reliably handle data uploads of any size while maintaining performance.

Simplicity: Easily ingest text data to Labelbox directly using Python strings. Instead of uploading text data as a text file in a Cloud bucket, you can directly provide your text strings to the Labelbox Python SDK. Similarly, you can directly provide text attachments, and conversational text dictionaries, to the Labelbox Python SDK.

Platform stability: The new architecture has fewer components, resulting in increased resilience and fewer errors.

No more data stuck processing: We’ve eliminated data rows getting stuck mid-processing – no more seeing some data rows stuck in an endless “processing” state after uploading.

Programmatic processing waits: SDK users can now wait for data to finish processing before taking actions on the data. Previously, a common source of friction for users was that there was no programmatic way to wait for data processing to finish. With the new pipelines, you can wait for data processing to finish by simply calling 'task.wait_till_done()'.

At Labelbox, our goal is to help enable you to unleash the full potential of your data. These new ingestion benefits empower you to accelerate AI initiatives faster than ever before.

Best practices for large-scale data uploads

Want to see the new data ingestion pipelines in action? Let’s walk through uploading 5 million images and 5 million text examples into Labelbox, using best practices for large data volumes.

Best practice #1: Use the Labelbox Python SDK for large data uploads

When ingesting big datasets, manually uploading through the UI isn't ideal. For large volumes beyond a few thousand data points, we recommend programmatically uploading via the Labelbox Python SDK.

To get started:

Install the Labelbox Python SDK
Import the SDK and provide your Labelbox API key for authentication

!pip3 install -q labelbox[data]
import labelbox as lb
import labelbox.data.annotation_types as lb_types
client = lb.Client(api_key="<YOUR_API_KEY>")

Best practice #2: Pick the right ingestion method

With Labelbox's flexibility, you can ingest data in multiple ways - whatever works best for your use case:

Upload raw data like text strings directly
Provide public URIs to assets on cloud buckets
Connect private URIs to assets on cloud storage buckets

For this 10 million sample, we'll:

Ingest 5 million texts directly as raw strings
Provide 5 million public image URIs to ingest

The key is choosing the method that aligns to your specific data infrastructure and access needs. With support for direct assets, public URIs, and private cloud storage, Labelbox fits right into your existing workflows.

Now let's put it into action by ingesting the text and image samples using the direct and public URI approaches.

We create 5 million pieces of text and images.

texts = ["Here is my text. Labelbox upload is fast, reliable, and built for scale!"]*5000000
images = ["https://storage.googleapis.com/labelbox-datasets/image_sample_data/image-sample-1.jpg!"]*5000000

We then create the corresponding Labelbox payload to upload texts and images.

payload_texts = []
for text in texts:
  payload = {
    "row_data": text,
    "global_key": str(uuid4()) ,
    "media_type": "TEXT",
  }
  payload_texts.append(payload)

payload_texts.append(payload)
payload_images = []
for image in images:
  payload = {
    "row_data": image,
    "global_key": str(uuid4()) ,
    "media_type": "IMAGE",
  }
  payload_images.append(payload)

Best practice #3: Chunk large uploads

When ingesting millions of data points, it’s best to break the upload into smaller, more manageable chunks.

We recommend chunk sizes around 150,000 data points when dealing with massive volumes.

chunk_len = 150000

chunks_of_payload_texts = []
for i in range(0,len(payload_texts),chunk_len):
  chunks_of_payload_texts.append( payload_texts[i:i+chunk_len] )
  
chunks_of_payload_images = []
for i in range(0,len(payload_images),chunk_len):
  chunks_of_payload_images.append(payload_images[i:i+chunk_len])

Sending manageable payloads ensures reliable and speedy ingestion. Labelbox's architecture easily handles the chunks in parallel. Chunking also makes ingestion more resilient. If an issue occurs mid-upload, you can resume where you left off instead of starting from scratch.

Best practice #4: Upload data asynchronously for faster processing

When ingesting large datasets, asynchronous uploading is a best practice if you don’t need to immediately act on the data.

To upload asynchronously in Python:

Iterate through payload chunks
Upload each chunk with 'task = dataset.create_data_rows(chunk)'
Monitor status with 'task.status'

# upload texts, chunk by chunk
start = time()
for chunk in chunks_of_payload_texts:
  # upload the chunk
  task = text_dataset.create_data_rows(chunk)
  # monitor task status
  print(task.status)

Using asynchronous uploads saved significant time here - all 10 million examples ingested in just 25 minutes with Labelbox efficiently handling the data chunks behind the scenes.

Best practice #5: Wait until data is processed before acting

If you need to immediately act on uploaded data in your script, you can wait for processing to finish before taking further actions in the same Python notebook or script.

You can use 'task.wait_till_done()' after each chunk to pause execution until data is ingested and processed:

# upload texts, chunk by chunk
for chunk in chunks_of_payload_texts:
  # upload the chunk
  task = text_dataset.create_data_rows(chunk)
  # wait for data to be uploaded and processed
  task.wait_till_done()
  # monitor task status
  print(task.status)
  print(task.errors)
  print(task.result)
  # take action on the newly uploaded data

Now you can take post-upload actions like:

Create a batch for labeling
Apply filters in Catalog
Send data to Model Foundry for pre-labeling by foundation models

A robust data platform with fast, scalable data ingestion capabilities is essential for organizations to effectively manage, explore, and leverage large datasets to train high-performing AI models. Labelbox offers a platform for teams to unleash the full potential of their data with ingestion benefits to power large scale AI initiatives.

Resources

Google Colab Notebook used in this guide: Best practices for data uploads in Labelbox
Documentation: How to upload data through the UI
Documentation: How to upload data through the SDK

Continue reading

Michael Haag•April 9, 2025

Introducing a powerful, new interactive Workflow editor

Learn about the new Labelbox Workflow that introduces an interactive, node-based editor to create, manage, and visualize multi-step review workflows.

Michael Haag•April 7, 2025

Q1 spotlight: Accelerating AI development with new products and services

Catch up on Labelbox's latest news from Q1, including expanded Leaderboards, the Alignerr Connect launch, and platform advancements empowering the next generation of AI models.

Ibrahim Muhammad•March 28, 2025

How to train and evaluate AI agents and trajectories with Labelbox

Learn how to use Labelbox's Multimodal Chat Editor for the key tasks of agent training and evaluation, using its new capabilities to evaluate and annotate the agent trajectories.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads