Working in collaboration with numerous leading companies in artificial intelligence, we're observing a surge in enthusiasm for using advanced models to initially label data, integrating human expertise later to refine and tailor previously labor-intensive and time-consuming tasks.
These AI models are transforming one of the most daunting tasks in machine learning—the creation of high-quality video datasets. Utilizing such models allows machine learning teams to leverage automated tools to pre-label or enrich data, facilitating a range of applications from monitoring driver behavior to detecting objects in manufacturing environments.
This blog post will explore how models like Gemini 1.5 Pro, Grounding DINO, and SAM are redefining the video labeling landscape and boosting efficiency and speed.
By automating the labor-intensive labeling tasks, these models not only accelerate the workflow but also liberate time for users and decrease labeling costs.
Workflow of selecting video of interest
Workflow of model of choice
Once the model of interest is selected, users can click on the model to view and set the model and ontology settings or prompt.
While this step is optional, generating preview predictions allows users to confirm the configuration settings confidently:
Workflow of configuring the batch and click “Submit”
Users can transfer the results to a labeling project using the UI via the "Send to Annotate" feature. Labelers can then quickly review labels for accuracy.
Segmentation masks are used for autonomous vehicles, medical imagery, retail applications, face recognition and analysis, video surveillance, satellite image analysis, etc. Masks are some of the most time-consuming annotations to make for video. Below, we see an example of how this can be automated with Grounding DINO + SAM so the reviewers can make small edits if needed instead of starting from scratch.
Bounding boxes are utilized in similar scenarios as segmentation masks, but these scenarios demand less precision than those requiring pixel-level (masks) detail. Bounding boxes can be automated using Grounding DINO, as illustrated below with detection of a person in video.
Global classification for video is used when the overall classification for video is required like when a driver safety system needs to detect if a driver is distracted. Gemini 1.5 Pro can analyze an hour long video and provide answers about events that took place in the video. This automation reduces the need for human intervention, allowing personnel to focus on reviewing videos only when they are flagged with specific classifications as shown below.
Frame-based classification is utilized in scenarios similar to segmentation masks. Gemini 1.5 Pro can analyze an hour-long video and identify the specific timestamps for a particular event. Below is an example that verifies whether the driver is distracted on each frame.
Additional considerations as users incorporate Foundry labels into their projects and workflows:
Annotating video data has traditionally been a tedious and time-consuming task. The integration of advanced AI models from Labelbox Foundry into the video labeling process marks a significant transformation in how video data is annotated. By leveraging Foundry's capabilities, users can drastically speed up their video labeling projects. This acceleration not only diminishes the time required to bring products to market but also substantially reduces the costs involved in model development.
Check out our additional resources on how to utilize state-of-the-art AI models in Foundry, including using model distillation and fine-tuning to leverage the power of foundation models: