Grounding Dino + SAM

Custom ontology
Image segmentation
Video segmentation

Grounding Dino + SAM, or Grounding SAM, uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models by enabling users to create segmentation masks quickly.

Intended Use

  • Create segmentation masks using SAM and classify the masks using Grounding Dino. The masks are intended to be used as pre-labels.


  • Inaccurate classification might occur, especially for aerial images for classification like roof and solar panels.

  • The accuracy of masks is suboptimal in areas with complex shapes, low contrast zones, and small objects.


Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others. (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499

Chen, Jiaqi and Yang, Zeyu and Zhang, Li. (2023). Semantic Segment Anything. https://github.com/fudan-zvg/Semantic-Segment-Anything