This year’s CVPR conference featured over two thousand papers. We parsed these papers to find some of the emerging themes in the computer vision field, and discovered three standout trends.

Transformers gain traction over CNNS

When transformers first entered the AI space in 2017, they were used primarily for language translation use cases, but they were soon adapted for multiple NLP tasks. In 2020, the paper An image is worth 16x16 words introduced vision transformers, and in 2021, a vision transformer was shown to be better than CNNs at image classification. CVPR 2022 introduced more work on vision transformers.

Scaling Vision Transformers is a study from the same group at Google Brain that introduced vision transformers in 2020. This work trains a transformer with two billion parameters, getting 90.45% top-1 accuracy on ImageNet and performing exceptionally well on few-shot transfer learning. Other papers like Deformable Video Transformer and Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds introduce transformers to new modalities in computer vision such as video and 3D data.

Neural radiance fields (NeRFs) take over 3D rendering

Neural Radiance Fields, or NeRFs, were introduced in 2020. At CVPR 2022, there were over 50 papers on radiance fields, and NeRFs now represent a valuable technique for anyone interested in volume rendering, view synthesis, and the state of the art in 3D rendering in general.

NeRF in the Wild shows how radiance fields can be used to create 3D representations using only unstructured datasets of in-the-wild photographs. The paper shows how this technique can even be used to capture varied lighting and transient occlusions. D-NeRF: Neural Radiance Fields for Dynamic Scenes introduces a variance on the NeRF technique that enables an algorithm to create 3D representations of scenes from photographs that show movement, rather than just static objects.

Transfer learning becomes a canonical technique

Several papers at CVPR 2022 show that transfer learning — the practice of taking a model trained on a broader or similar use case to the task at hand and fine-tuning it for one’s requirements — is a successful technique for computer vision cases. A paper evaluating both CNNs and transformers trained via transfer learning shows that transformers carried high performance from ImageNet classification into downstream tasks better than a similar CNN. Robust Fine-Tuning of Zero-Shot Models introduces a simple and effective method for improving robustness during the fine-tuning process: assembling the weights of the zero-shot and fine-tuned models.

Other papers worth noting from CVPR 2022 include:

  • Pointly-Supervised Instance Segmentation: This paper shows how to get to 94-98% of the fully supervised performance of mask R-CNN by annotating only a bounding box and ten random points with in/out labels instead of drawing the whole mask. This approach has the potential of introducing significant savings in labeling time and cost for segmentation problems.
  • Estimating Example Difficulty using Variance of Gradients: This study compares gradients between different epochs in a training pass to find “difficult” examples in a training set — examples that are noisy during neural network convergence. By looking at what different images do in subsequent snapshots, ML engineers can figure out which ones will be trivial representatives of a class and which ones will be closer to corner cases.
  • Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction: This paper introduces H-NNE, a new dimensionality reduction method. The study shows that it takes less than six minutes to project over a million embeddings. The resulting projects could be an alternative to methods like t-SNE or UMAP, with performance over an order of magnitude better.

We enjoyed meeting with customers, partners, and AI practitioners at CVPR 2022, and look forward to supporting their endeavors to build better, faster computer vision AI.