How John Deere's data engine automates data curation and labeling from 1B+ assets


Blue River Technology needed to rapidly scale and optimize their computer vision model development pipeline and decrease their iteration cycles — which often took several weeks — to hours in order to deliver the best AI-powered products. Two of the primary causes of delay in their processes were data management and infrastructure being created and maintained by ML engineers and an arduous data curation process that took longer and became more painful as the amount of data increased exponentially.


The team built a unified machine learning and data engine that leverages embedded integrations with best-in-class data storage and management, data curation, and labeling solutions. The platform also includes multiple robust and innovative applications designed to increase efficiencies and reduce ML engineering workloads.


With the new data engine, Blue River Technology’s ML teams can now spend more of their time focusing on training, monitoring, and maintaining their computer vision models. Their data scientists can pull updated, refined, relevant datasets for every use case and model within minutes via Labelbox Catalog.

Note: This article is a short recap of a talk from Blue River Technology at the Databricks Data + AI Summit in July 2023. 

Blue River Technology is an independently run subsidiary of John Deere that creates transformative computer vision and machine learning solutions for a variety of farming, forestry, and construction machinery. Over the past several years, their See & Spray technology, which uses computer vision to recognize weeds among crops, has won recognition and acclaim for helping farmers spray only the necessary plants in their fields, saving them significant costs in the amount of pesticides and weed killers required and also vastly reducing the amount of these chemicals released into the environment.

As the team built, monitored, and scaled the models that power their innovative solutions, one specific challenge emerged: the model training lifecycle took weeks to complete. As their technology was used on more machines, the amount of data collected grew exponentially. Ingesting and acting on petabytes of data fast became imperative to scaling up — and a challenge for their traditional data management and MLOps pipeline. Some areas of inefficiency in their setup included:

  1. Data curation: Finding the right data for a particular use case and model became harder as the amount of data increased, and grew into a long and painful manual process.

  2. Data labeling: Labeling data by hand was a long and expensive process that significantly delayed the ML iteration cycle.

  3. Collaboration: Data scientists, ML engineers, and other team members who had to spend far too much time managing data and figuring out data infrastructure, rather than focusing on what they do best: training, deploying, and maintaining models.

To address these issues, the Blue River Technology team built a new and improved version of their original, more simple computer vision system, which sat on a traditional data lake with curated data stores, and had their entire ML system running on Kubeflow. The new system, a fully integrated, unified data and ML platform, was designed to increase efficiency, enable growth, and afford its end users (ML engineers, data scientists, and others building CVML models) more time and independence to do their work without building or maintaining any ad hoc infrastructure.

This system has a robust data foundation, an ML system that leverages both Kubeflow and Databricks with its latest capabilities to allow engineers more freedom and flexibility, and embedded integrations with best-in-class tools for data curation, automated data labeling, and more.

The team had already implemented Labelbox’s model-assisted labeling workflow that cut human labeling time and costs in half. To augment this process and ensure that it produced the highest quality training data for their models, they also built an automated model-assisted quality control pipeline that finds and logs discrepancies between model-generated labels and human labels and displays them on a smart audit dashboard that can be easily monitored by the team.

With Labelbox Catalog as their integrated data curation solution, the team can easily leverage advanced techniques such as similarity search, natural language search, metadata augmentation, and more to “find the needle in the haystack.” They are now also able to create ultra-refined datasets for each use case and model, and because the conditions and rules for each of these datasets are applied to all incoming data, new images that fall into these categories are automatically added to the datasets. As a result, ML teams within Blue River Technology can now access updated, curated datasets that match their requirements within minutes, even though the organization at large works with over a billion images.

Moving forward, the team plans to use their new integrated ML platform to build more innovative solutions, as well as continue to test and build new ways to increase the efficiency and scalability of the platform. To learn more about Blue River Technology’s data and machine learning platform, their tech stack, and innovations in MLOps infrastructure, watch the team’s presentation from the Data+AI Summit 2023.