How to automatically ingest data from Databricks into Labelbox

Moving data seamlessly through your MLOps pipeline is essential to building successful AI products. In this guide, learn how you can leverage Labelbox’s Databricks pipeline creator to automatically ingest data from your Databricks domain into Labelbox for data exploration, curation, labeling, and much more.

1.  Navigate to the pipeline creator webpage and enter your Databricks domain. You can find this by going to your Databricks environment. The domain will be in the URL, so you can copy and paste it into the pipeline creator.

Your Databricks domain is in the URL whenever you access your Databricks environment.

2. Now select the cloud environment that your Databricks workspace runs in. This information is usually also in the Databricks domain that you just copy/pasted. 

3. If you don’t already have a Databricks API, create one by going to your Databricks domain. Go to the user tab in the top right, then go to User settings > Developer > Access Tokens, and create an access token. Paste your Databricks API key into the pipeline creator.

You can create an access token for the Databricks API by going to the user tab, then to User Settings > Developer > Access tokens.

4. Next, you’ll need your Labelbox API key. If you don’t already have one, you can create one from within your Labelbox environment by going to Workspace Settings > API > Create a new key. Paste this key into the creator pipeline.

Create an API key in Labelbox from the Workspace Settings section of your environment.

5. Next, the pipeline creator will give you the option of making a new dataset or appending an existing one. Be sure to give the dataset a relevant name, as it will appear under than name within Labelbox Catalog once the dataset has been created and the data ingested from Databricks.

Name your dataset within the pipeline creator webpage.

6. Select a cluster from within your Databricks environment on the pipeline creator page. 

7. Once the cluster is ready, you’ll see the option to select a run frequency, or the cadence with which the workflow is going to execute.

Choose how often you want this workflow to run.

8. Next, choose the relevant table and database in Databricks from which you want to pull data.

Choose the database and table you want to pull data from.

9. The page will show you a sample of data rows within the selected table. The row data column signifies the URL by which you want to pool the data from. This can either be a public URL or objects hosted within your cloud storage. 

A sample of the data rows in the selected dataset will appear on the pipeline creator page.

10. Choose the data row column pointing to the object that you want to render. You also have an additional option of choosing the global key, which will specify the unique identifier assigned to each data row once they’re ingested into Labelbox.

Select your row data column and if needed, select global key.

11. Click on Deploy pipeline. After executing for a few seconds, you’ll see a confirmation that your pipeline has been deployed. Now you can navigate to your Databricks environment, go to the workflows tab on the left, and see the new upload workflow that you’ve just created. Once the workflow has run, you’ll be able to see the ingested data in Labelbox Catalog.

See the new upload workflow within your Databricks environment by navigating to the Workflows tab.

Your new workflow will now ingest the specified data into Labelbox at the cadence you chose. Read this blog post to learn more about how you can integrate Databricks and Labelbox into a seamless data engine for AI.