Tracking surgical instruments in video to advance robotic surgery
Problem
Intuitive Surgical's data science team needed to detect and track surgical instruments across tens of thousands of surgical videos. Producing that volume of accurate, frame-by-frame spatial signal — under a consistent ontology shared across clinical and data science teams — was the bottleneck to training the models.
Solution
Intuitive Surgical used Labelbox to produce the spatial training signal its models needed. The native video editor captured frame-by-frame instrument detection and tracking, and model-assisted labeling refined pre-labels into corrected, expert-graded signal. A shared ontology aligned clinical and data science teams so every piece of signal was consistent, with detailed metrics on signal velocity and quality across model versions.
Result
The team produces richer spatial signal for its video projects, scaled signal throughput, lowered the overhead of gathering performance and quality metrics, and doubled the speed at which they deliver training signal for their multiple ML models.

Intuitive Surgical builds computer vision models that detect and track instruments in surgical video. Labelbox's platform produced the spatial training signal, under a consistent ontology, and doubled the speed of delivering it.
The challenge
Intuitive Surgical pioneered robotic-assisted surgery with the da Vinci Surgical System. Its data science team builds machine learning models that work behind the scenes: assessing surgical performance, identifying skilled tool use and choreography, and planning operating-room resources. One capability mattered most — automatically detecting and tracking surgical instruments in video. The robotic systems generate a rich set of data, but training those models meant annotating bounding boxes frame-by-frame across tens of thousands of surgical videos, covering a wide variety of tools and procedures. Producing that volume of accurate spatial signal, under a consistent ontology across clinical and data science teams, was the bottleneck.
The approach
Intuitive Surgical used Labelbox to produce the structured training signal its models needed. The platform's native video editor captured frame-by-frame instrument detection and tracking, and model-assisted labeling refined pre-labels into corrected, expert-graded signal — cutting the work to produce each batch. The team also encoded informative data like timestamps of instrument installation and removal. Labelbox aligned clinical and data science teams on a shared ontology so every piece of signal was consistent and meaningful to the models, and gave them detailed metrics on signal velocity and quality across model versions.
The outcome
Intuitive Surgical now produces richer spatial signal for its video projects, locating objects of interest in the camera feed for classification. The team scaled signal throughput, lowered the overhead of gathering performance and quality metrics, and doubled the speed at which they deliver training signal for their multiple ML models.
The key question we are tackling is how do we make surgery a better experience? The goal is to achieve a more efficient annotation pipeline so that given all this rich data that we collecting, we want to provide insights with actionable and trusted feedback that help surgeons improve their performance. We rely on collaborative software to help align our different teams such as our clinical teams and data science teams to ensure that we have a clearly defined ontology. This ensures that all labeling activities are consistent and provides meaningful value to our models.
— Xi Liu, Manager of ML and Data Science, Intuitive Surgical
Where this goes
Surgical video is a grounding source for embodied intelligence. The pattern is the one frontier teams use: expert-graded signal under a tight ontology, feeding models that improve each iteration.