Labelbox•October 26, 2023
10X faster uploads: Labelbox’s data ingestion upgrades and how to use them
The importance of large-scale data uploads
The rapid adoption of AI, fueled by advancements in foundation models, LLMs, and generative AI, has led to an explosion in data needs. As companies now routinely deal with hundreds of millions or even billions of data points — images, videos, text, PDFs — the ability to effectively manage, curate, and label data has become critical.
Advanced AI teams typically start by consolidating all of their data into a single, powerful platform that enables effective data exploration, visualization, and management from the get-go. A robust data platform is also essential as models go into production — to capture and analyze production data in order to evaluate model performance, find challenging slices of data, and identify new labels for continuous model improvement.
Labelbox is the ideal data platform to meet the growing demand for data and AI needs. And at the root of it all is fast, reliable, and scalable data ingestion. Behind the scenes, we’ve invested significant effort into supercharging our data ingestion capabilities, resulting in 10X faster performance. In this post, we’ll give an overview of these new ingestion pipelines and share best practices for large-scale data ingestion uploads into Labelbox.
Get your largest datasets into Labelbox in seconds
We’ve been hard at work behind the scenes to completely overhaul our data ingestion pipelines. The results? Fast, ultra-reliable data uploads capable of handling large data volumes.
Here are just a few advantages that our new pipelines unlock for your data:
Blazingly fast uploads and processing: Get data, metadata, attachments, embeddings into Labelbox 10x faster than before.
Limitless scale: Our new architecture can reliably handle data uploads of any size while maintaining performance.
Simplicity: Easily ingest text data to Labelbox directly using Python strings. Instead of uploading text data as a text file in a Cloud bucket, you can directly provide your text strings to the Labelbox Python SDK. Similarly, you can directly provide text attachments, and conversational text dictionaries, to the Labelbox Python SDK.
Platform stability: The new architecture has fewer components, resulting in increased resilience and fewer errors.
No more data stuck processing: We’ve eliminated data rows getting stuck mid-processing – no more seeing some data rows stuck in an endless “processing” state after uploading.
Programmatic processing waits: SDK users can now wait for data to finish processing before taking actions on the data. Previously, a common source of friction for users was that there was no programmatic way to wait for data processing to finish. With the new pipelines, you can wait for data processing to finish by simply calling 'task.wait_till_done()'.
At Labelbox, our goal is to help enable you to unleash the full potential of your data. These new ingestion benefits empower you to accelerate AI initiatives faster than ever before.
Best practices for large-scale data uploads
Want to see the new data ingestion pipelines in action? Let’s walk through uploading 5 million images and 5 million text examples into Labelbox, using best practices for large data volumes.
Best practice #1: Use the Labelbox Python SDK for large data uploads
When ingesting big datasets, manually uploading through the UI isn't ideal. For large volumes beyond a few thousand data points, we recommend programmatically uploading via the Labelbox Python SDK.
To get started:
- Install the Labelbox Python SDK
- Import the SDK and provide your Labelbox API key for authentication
!pip3 install -q labelbox[data]
import labelbox as lb
import labelbox.data.annotation_types as lb_types
client = lb.Client(api_key="<YOUR_API_KEY>")
Best practice #2: Pick the right ingestion method
With Labelbox's flexibility, you can ingest data in multiple ways - whatever works best for your use case:
- Upload raw data like text strings directly
- Provide public URIs to assets on cloud buckets
- Connect private URIs to assets on cloud storage buckets
For this 10 million sample, we'll:
- Ingest 5 million texts directly as raw strings
- Provide 5 million public image URIs to ingest
The key is choosing the method that aligns to your specific data infrastructure and access needs. With support for direct assets, public URIs, and private cloud storage, Labelbox fits right into your existing workflows.
Now let's put it into action by ingesting the text and image samples using the direct and public URI approaches.
We create 5 million pieces of text and images.
texts = ["Here is my text. Labelbox upload is fast, reliable, and built for scale!"]*5000000
images = ["https://storage.googleapis.com/labelbox-datasets/image_sample_data/image-sample-1.jpg!"]*5000000
We then create the corresponding Labelbox payload to upload texts and images.
payload_texts = []
for text in texts:
payload = {
"row_data": text,
"global_key": str(uuid4()) ,
"media_type": "TEXT",
}
payload_texts.append(payload)
payload_texts.append(payload)
payload_images = []
for image in images:
payload = {
"row_data": image,
"global_key": str(uuid4()) ,
"media_type": "IMAGE",
}
payload_images.append(payload)
Best practice #3: Chunk large uploads
When ingesting millions of data points, it’s best to break the upload into smaller, more manageable chunks.
We recommend chunk sizes around 150,000 data points when dealing with massive volumes.
chunk_len = 150000
chunks_of_payload_texts = []
for i in range(0,len(payload_texts),chunk_len):
chunks_of_payload_texts.append( payload_texts[i:i+chunk_len] )
chunks_of_payload_images = []
for i in range(0,len(payload_images),chunk_len):
chunks_of_payload_images.append(payload_images[i:i+chunk_len])
Sending manageable payloads ensures reliable and speedy ingestion. Labelbox's architecture easily handles the chunks in parallel. Chunking also makes ingestion more resilient. If an issue occurs mid-upload, you can resume where you left off instead of starting from scratch.
Best practice #4: Upload data asynchronously for faster processing
When ingesting large datasets, asynchronous uploading is a best practice if you don’t need to immediately act on the data.
To upload asynchronously in Python:
- Iterate through payload chunks
- Upload each chunk with 'task = dataset.create_data_rows(chunk)'
- Monitor status with 'task.status'
# upload texts, chunk by chunk
start = time()
for chunk in chunks_of_payload_texts:
# upload the chunk
task = text_dataset.create_data_rows(chunk)
# monitor task status
print(task.status)
Using asynchronous uploads saved significant time here - all 10 million examples ingested in just 25 minutes with Labelbox efficiently handling the data chunks behind the scenes.
Best practice #5: Wait until data is processed before acting
If you need to immediately act on uploaded data in your script, you can wait for processing to finish before taking further actions in the same Python notebook or script.
You can use 'task.wait_till_done()' after each chunk to pause execution until data is ingested and processed:
# upload texts, chunk by chunk
for chunk in chunks_of_payload_texts:
# upload the chunk
task = text_dataset.create_data_rows(chunk)
# wait for data to be uploaded and processed
task.wait_till_done()
# monitor task status
print(task.status)
print(task.errors)
print(task.result)
# take action on the newly uploaded data
Now you can take post-upload actions like:
- Create a batch for labeling
- Apply filters in Catalog
- Send data to Model Foundry for pre-labeling by foundation models
A robust data platform with fast, scalable data ingestion capabilities is essential for organizations to effectively manage, explore, and leverage large datasets to train high-performing AI models. Labelbox offers a platform for teams to unleash the full potential of their data with ingestion benefits to power large scale AI initiatives.
Resources
- Google Colab Notebook used in this guide: Best practices for data uploads in Labelbox
- Documentation: How to upload data through the UI
- Documentation: How to upload data through the SDK