BLIP2 (Flan-T5 XL COCO)
Text generation
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering.
Intended Use
BLIP2 is a visual language model (VLM) that can perform multi-modal tasks such as image captioning and visual question answering.
Performance
BLIP-2 ViT-g OPT2.7B has a score of 52.3 on VQAv2 dataset
Citation
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
Privacy policy
Labelbox