logo

LLaVA Instruct 150K

Contributors: Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Datarows: 157,712 Datarows
Conversational text
Foundation models

The LLaVA Visual Instruct 150K dataset comprises samples of multimodal data designed for instruction-following tasks. This dataset was generated by using the GPT-4 model to produce responses to image-based prompts sourced from the COCO dataset. It comprises three sections: conversations, detailed descriptions, and complex reasoning. The primary use of this dataset is to train and assess multimodal language models, like LLaVA, that can respond to both text and images.

Citation
https://arxiv.org/abs/2310.03744
License
Attribution-Non Commercial 4.0 International