How Blue River Technology's data engine automates data curation and processing from 1B+ assets

Blue River's See & Spray uses computer vision to tell weeds from crops. Labelbox's platform produces the training signal and curates the right data from over a billion images in minutes.

The challenge

Blue River Technology, an independently run John Deere subsidiary, builds computer vision and machine learning for farming, forestry, and construction machinery. Its See & Spray technology uses computer vision to recognize weeds among crops, so machines spray only the plants that need it — cutting pesticide costs and the chemicals released into the environment. As more machines ran the technology, the data grew exponentially into petabytes, and the model training lifecycle took weeks. The traditional data and MLOps pipeline couldn't keep up:

Data curation: finding the right data for a use case and model got harder as data grew — a long, manual process.
Data processing: processing pipelines were slow and expensive, delaying the ML iteration cycle.
Collaboration: data scientists and ML engineers spent too much time managing data and infrastructure instead of training, deploying, and maintaining models.

The approach

Blue River built a unified data and ML platform on Kubeflow and Databricks, with embedded integrations for data curation and automated workflows — so engineers could build CVML models without standing up ad hoc infrastructure.

Labelbox produces the training signal. The team had already implemented Labelbox's model-assisted feedback workflows, which cut the data processing time and costs in half. To keep signal quality high, it built an automated model-assisted quality control pipeline that finds and logs discrepancies between model-generated outputs and expert human feedback and surfaces them on a smart audit dashboard. With Labelbox Catalog for curation, the team uses similarity search, natural language search, and metadata augmentation to find the needle in the haystack, and builds ultra-refined datasets per use case — with rules that auto-add matching new images. ML teams pull updated, curated datasets in minutes, even across over a billion images.

The outcome

Blue River's ML teams now spend their time training, monitoring, and maintaining computer vision models instead of wrangling data. Data scientists pull refined, relevant datasets for any use case or model within minutes via Labelbox Catalog, against a corpus of over a billion images.

Where this goes

Agriculture is embodied intelligence in the field. The same pattern powering frontier robotics applies here: a data engine that turns raw sensor data into curated, expert-graded training signal — and models that improve as fast as the data arrives.

Curating training signal from 1B+ images for farm robotics

Problem

Solution

Result

The challenge

The approach

The outcome

Where this goes

Try Labelbox today