LabelboxJuly 6, 2023

Recapping CVPR 2023: Key takeaways from Labelbox

The research and knowledge shared at Computer Vision Pattern Recognition (CVPR) are often the first indicators of new and transformative leaps in the rapid evolution of computer vision and AI. This year in Vancouver, the conference featured in-depth workshops and tutorials, informative discussions on the future of computer vision, and more than two thousand research papers. Over ten thousand people in the data science, computer vision, and machine learning fields attended the conference. In this post, we’ll explore and distill some of the overarching themes we took from CVPR 2023.

Repurposing computer science solutions for AI creates new opportunities to scale

As Rodney Brooks noted in his keynote address, the hardware and compute power available to us today can help bring old ideas to life — and completely transform how we use older concepts and solutions. This theme carried into several studies that took concepts that have been around for decades in disciplines such as computer graphics. In one example, a process long-used in the field of visual effects to extract geometry from point clouds was repurposed for computer vision with positive results. Another group built Infigen, an AI solution that procedurally generates photorealistic natural worlds. Procedural models are a standard solution in filmmaking, and it has now been adapted for computer vision.

Bringing tried-and-true practices and solutions from computer science to the field of AI can result in more high-powered versions of these solutions. Over time, we’ll likely see more common software solutions become faster and generally more powerful as they’re transformed by AI.

Innovations in embeddings make data easier to understand & explore

The process of using embeddings from data to better explore and search it is becoming more common — and easier — for AI builders. Advances in the creation of embeddings can help AI builders tackle more challenges in data curation and processing. One study found a novel type of video language embedding called HierVL that can simultaneously account for both long- and short-term associations. Until now, video-language embeddings could only account for associations between seconds-long clips and their accompanying text.

Another study presented visual DNA: a new method for comparing and evaluating datasets based on their features or attributes. By creating these “embeddings” for datasets, they can be analyzed against one another using various metrics, making it easier to identify which datasets are most similar to their target dataset, and use those datasets to improve the accuracy and performance of their AI models. It can also be used to pinpoint which neurons in a pre-trained feature extractor are most sensitive to differences between datasets, allowing AI builders to focus on those neurons when fine-tuning their models. This can lead to improved accuracy and performance, as the model is better able to distinguish between different classes or categories within the dataset. Overall, visual DNA provides a powerful tool for AI builders to improve the effectiveness of their datasets and models, leading to better performance and more accurate results.

The creation of new, more robust types of embeddings has the potential to completely transform how AI teams understand and explore their data. With the right software and infrastructure, embeddings can enable teams to gain a comprehensive picture of what’s in their datasets, search and filter for specific values, and do it all in a matter of minutes — significantly accelerating the path to AI development without making any compromises on data quality.

Foundation models mark a paradigm shift in how we build AI

Much of the transformative research at CVPR 2023 featured new ways to use foundation models, generative AI, and other off-the-shelf algorithms throughout the AI development process. OpenAI’s CLIP (Contrastive Language-Image Pretraining) model has emerged as a crucial and foundational model in the field of computer vision. By combining the power of natural language processing and image understanding, CLIP has showcased its remarkable ability to bridge the semantic gap between textual descriptions and visual representations.

Researchers are using it for everything from semantic segmentation to action recognition in video data to zero-shot model diagnosis. Many demos and sessions were devoted to exploring how we can better use popular generative and foundation models such as Meta’s Segment Anything Model, Stable Diffusion, and more.

With more data scientists and ML engineers using these cutting-edge AI solutions to accelerate and optimize the way they build their own AI products, the emerging challenge is to find an easier way to connect these models with their existing MLOps infrastructure for various processes, including:

  • A/B testing and comparing foundation models on specific metrics
  • Pulling foundation model output into labeling and review workflows as pre-labeled data
  • Moving unstructured data through foundation models for enrichment

That’s why Labelbox is launching Model Foundry, a new solution that enables teams to easily choose and connect with a foundation model to enhance and improve their AI development workflows. Sign up today to access the waitlist.