The LLaVA Visual Instruct 150K dataset comprises samples of multimodal data designed for instruction-following tasks. This dataset was generated by using the GPT-4 model to produce responses to image-based prompts sourced from the COCO dataset. It comprises three sections: conversations, detailed descriptions, and complex reasoning. The primary use of this dataset is to train and assess multimodal language models, like LLaVA, that can respond to both text and images.