LLaVA Instruct 150K

Contributors: Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Datarows: 157,712 Datarows
Conversational text
Foundation models

The LLaVA Visual Instruct 150K dataset comprises samples of multimodal data designed for instruction-following tasks. This dataset was generated by using the GPT-4 model to produce responses to image-based prompts sourced from the COCO dataset. It comprises three sections: conversations, detailed descriptions, and complex reasoning. The primary use of this dataset is to train and assess multimodal language models, like LLaVA, that can respond to both text and images.

Attribution-Non Commercial 4.0 International