×

Curating training signal from 1B+ images for farm robotics

Problem

Blue River Technology needed to scale and optimize its computer vision model development and shrink iteration cycles — often weeks — to hours. Two big sources of delay: ML engineers building and maintaining data infrastructure, and a data curation process that grew more painful as data increased exponentially.

Solution

Blue River built a unified machine learning and data engine with embedded integrations for data storage, curation, and labeling. Labelbox produces the training signal and curates it — model-assisted labeling, automated quality control, and Catalog for finding the right data — so ML engineers spend less time on infrastructure.

Result

With the new data engine, Blue River's ML teams spend more time training, monitoring, and maintaining their computer vision models. Data scientists pull updated, refined, relevant datasets for every use case and model within minutes via Labelbox Catalog.

Curating training signal from 1B+ images for farm robotics

Blue River's See & Spray uses computer vision to tell weeds from crops. Labelbox's platform produces the training signal and curates the right data from over a billion images in minutes.

The challenge

Blue River Technology, an independently run John Deere subsidiary, builds computer vision and machine learning for farming, forestry, and construction machinery. Its See & Spray technology uses computer vision to recognize weeds among crops, so machines spray only the plants that need it — cutting pesticide costs and the chemicals released into the environment. As more machines ran the technology, the data grew exponentially into petabytes, and the model training lifecycle took weeks. The traditional data and MLOps pipeline couldn't keep up:

  1. Data curation: finding the right data for a use case and model got harder as data grew — a long, manual process.

  2. Data labeling: labeling by hand was slow and expensive, delaying the ML iteration cycle.

  3. Collaboration: data scientists and ML engineers spent too much time managing data and infrastructure instead of training, deploying, and maintaining models.

The approach

Blue River built a unified data and ML platform on Kubeflow and Databricks, with embedded integrations for data curation and automated labeling — so engineers could build CVML models without standing up ad hoc infrastructure.

Labelbox produces the training signal. The team had already implemented Labelbox's model-assisted labeling, which cut human labeling time and costs in half. To keep signal quality high, it built an automated model-assisted quality control pipeline that finds and logs discrepancies between model-generated labels and human labels and surfaces them on a smart audit dashboard. With Labelbox Catalog for curation, the team uses similarity search, natural language search, and metadata augmentation to find the needle in the haystack, and builds ultra-refined datasets per use case — with rules that auto-add matching new images. ML teams pull updated, curated datasets in minutes, even across over a billion images.

The outcome

Blue River's ML teams now spend their time training, monitoring, and maintaining computer vision models instead of wrangling data. Data scientists pull refined, relevant datasets for any use case or model within minutes via Labelbox Catalog, against a corpus of over a billion images.

Where this goes

Agriculture is embodied intelligence in the field. The same pattern powering frontier robotics applies here: a data engine that turns raw sensor data into curated, expert-graded training signal — and models that improve as fast as the data arrives.