Understanding Data I/O
Data In
Data Out
1. Streamlining Data Out
2. Utilizing Your Exported Data

A guide to the Data I/O process in Labelbox

As you navigate the world of intelligent application creation, one element remains pivotal - your data. At Labelbox, we recognize the value of your data and its role in driving your operations, which is why we focus on simplifying the Data In and Out (I/O) process. In this guide, we take an in-depth look at the Data I/O process and offer a step-by-step guide to streamline your interaction with our platform.

Whether your goal is to store data in a cloud-hosted table, a ML training pipeline, a database, or even a production environment, our aim is to equip you with the knowledge and tools needed for a flexible, robust, and effective data management system.

Understanding Data I/O

At its core, Data I/O refers to the import and export of data in your Labelbox workflow. As simple as it sounds, the process can be quite intricate, given the variety of data formats and storage locations. To manage data effectively, Labelbox uses a structured approach that tackles each data type - data rows, metadata (including embeddings), attachments, and annotations.

Data In

Data Rows

Data rows typically exist in cloud storage like Amazon Web Services (AWS), Google Cloud Storage (GCS), or Microsoft Azure (Azure). However, data rows can also exist as local files, offering you flexibility in how you access and utilize your data.

Setting Up Data In

For those using cloud storage, setting up Delegated Access is the first step. This involves granting Labelbox permission to securely gain read access to your unlabeled data as hosted in your preferred cloud storage provider while providing Labelbox with the limited access necessary to display and label your data. Once access is granted, you need to identify where your metadata (including embeddings) and attachments are stored.

Refer to the below links to learn more about setting up Delegated Access with the below cloud storage providers:

Attachments and Metadata

Attachments and metadata traditionally exist in a table (Databricks, Excel, BigQuery, CSV, etc.) alongside or separately from the data rows. In an attempt to maintain consistency and streamline the process, Labelbox encourages users to upload metadata and attachments directly with the data rows.

1) Metadata is any information known about an asset pre-Labelbox that could be useful in data filtering and selection Metadata also includes embeddings - these are representations of your data in a vector space. These vectors capture the essential features of your data and represent them in a form that can be processed by machine learning algorithms.

Developer guide on metadata

2) Attachments on the other hand could be any additional files or information that supplement your data and assist in the creation of high quality human labels. This could include anything from text documents with descriptive data, additional images, audio files, or any other data type that provides more context to your main data row.

Developer guide on attachments

3) Embeddings can improve your data exploration and allow you to make use of similarity search within Catalog. Lablebox computes off-the-shelf embeddings using neural networks trained on publicly available data. Off-the-shelf embeddings provide a useful starting point to explore your data, but to get the most out of similarity search you’ll want to experiment with different embeddings to power your selection based on your particular data.

Developer guide on embeddings

Here are a few pointers:

If your data is in a BigQuery table, you can refer to this connector
If your data is in a Databricks table, check out this connector
For data in a CSV format, use this connector
For unique data formats, our comprehensive Labelbox documentation will be your guide

Annotations

As an output from machine learning models, annotations typically exist in JSON files and can be stored either locally or in cloud storage. The beauty of annotations lies in their versatility; they come in various forms including bounding box, mask, radio classification, and others, giving you the freedom to choose what best suits your application.

When it comes to Labelbox, if you want to upload pre-labels or submitted labels (labels made elsewhere), the process involves a few crucial steps:

1) Setting up a Labelbox Project and Ontology: An ontology serves as a blueprint for your labeling project. It defines the labels and the structure for the annotation data you are handling. Setting up an ontology in Labelbox that aligns with your annotation data is an essential step to ensure that your pre-labels or submitted labels can be properly read and processed by the system. You can learn more about how to set up ontologies in our developer guide on ontologies.

2) Understanding the Annotation Data Format: Labelbox supports a wide range of data formats including JSON, CSV, and others. It is important to understand the format of your annotation data to ensure compatibility with the Labelbox platform.

3) Identifying the Annotation Types: The type of annotation used is dependent on your use case. It could be a bounding box annotation for object detection tasks, a mask annotation for semantic segmentation tasks, or a radio classification for multi-choice tasks. By identifying the annotation types that your use case requires, you can ensure that your data is appropriately annotated for your model.

4) Determining the Annotation Format: Annotation formats can either be customer-specific or adhere to industry standards like COCO. Understanding this will help you prepare your data in a way that fits the requirements of Labelbox and aids in efficient data processing.

By paying close attention to these steps, you can maximize the utility of your annotations, thereby boosting the effectiveness of your labeling projects and the performance of your machine learning models.

5) Converting Annotations to Labelbox Format: Labelbox supports two formats for importing annotations: NDJSON and Python annotation type. How these annotations are uploaded depends on your media type, see the links below for further information.

Data Out

Contrary to data in, the data out process focuses on how data is extracted from Labelbox. Extracting data from Labelbox involves exporting labeled data rows, along with their associated metadata, attachments, and annotations. Data exported from Labelbox can be used for a variety of purposes, such as model training or data enrichment. Given that every model has unique input requirements and organizations have unique data storage formats, the export process is often more unique than the import.

Streamlining Data Out

The Labelbox platform supports a variety of data formats and storage solutions to ensure that your data is exported in a format that suits your needs and is compatible with your storage system. Whether you're using BigQuery, Databricks, CSV, or other formats, Labelbox has a solution for you. Here's how you can navigate the process:

1) Choose an Export Format: Determine the export format that suits your needs and is compatible with your storage system. Labelbox supports a wide variety of formats including JSON, CSV, and others. This gives you the flexibility to choose a format that best aligns with your downstream workflows.

2) Select a Connector: Just as with data in, Labelbox offers connectors to aid in exporting data. These connectors are designed to seamlessly bridge the gap between Labelbox and your storage system, making the data out process efficient and hassle-free. Labelbox offers connectors to aid in exporting data:

For BigQuery users, this connector will come in handy
Databricks users can refer to this connector
For CSV-formatted data, this connector is useful

3) Export your Data: Initiate the export process via your chosen connector. During the export, Labelbox compiles your labeled data rows, associated metadata, attachments, and annotations, and organizes them in your chosen export format.

For a deeper understanding of the process, our documentation provides a wealth of information.

In essence, the Data I/O process is a crucial aspect of your interaction with Labelbox, designed to make your data management effortless. As you become more familiar with the process, you'll find it an essential tool in creating intelligent applications with Labelbox.

Utilizing Your Exported Data

Once your data is exported, it is ready to be utilized for a variety of purposes:

1) Model Training: The primary use case of exported data is to feed it into your machine learning models. The labeled data serves as the training data, guiding your models to recognize patterns and make predictions.

2) Data Analysis & Enrichment: The exported data, especially metadata and annotations, can provide valuable insights when analyzed. Additionally, it can enrich your existing data sets, enhancing the accuracy and detail of your data and leading to more effective models and analytics. This could guide decision-making processes and strategies within your organization.

3) Iterative Improvements: In some cases, exported data can also be fed back into your annotation process for iterative improvements, creating a feedback loop that continually enhances your data quality and model performance.

In conclusion, the Data I/O process in Labelbox is an integral component to successful application creation. This guide provides an in-depth understanding of the process, allowing you to accurately import and export data, whether that be data rows, metadata, attachments, or annotations. The process ensures compatibility with various data formats and storage systems, while facilitating efficient data management for your machine learning projects.

For detailed instructions or further understanding, our comprehensive documentation is always available.

Continue reading

Programmatically launch human data jobs for RLHF and evaluation

Learn how to harness the SDK to manage human data labeling jobs for RLHF and model evaluation. With just a few steps, you can set up the SDK, import various types of data, and launch, monitor, and export labeling projects programmatically, all while ensuring data quality and scalability.

Evaluating leading text-to-speech models

Discover how to employ a more comprehensive approach to evaluating leading text-to-speech models using both human preference ratings and automated evaluation techniques.

Metrics-based RAG Development with Labelbox

Learn how to optimize your Retrieval-Augmented Generation (RAG) applications by focusing on key metrics like context recall and precision.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads