Text generation

LLaVA v1.5 is a Large Language and Vision Assistant.

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Intended Use

Multimodal Chat Capabilities: LLaVA has been designed to exhibit impressive chat capabilities, particularly when fine-tuned alongside models like GPT-4, for general-purpose visual and language understanding.

Visual Instruction Following: LLaVA is a pioneering effort in the field of Visual Instruction Tuning, a technique that fine-tunes large language models to understand and execute instructions based on visual cues. This is particularly beneficial in scenarios where a model needs to describe an image, perform an action in a virtual environment, or answer questions about a scene in a photograph.


Early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.


Hallucination: Similar to GPT-3.5, LLaVA has limitations like hallucination, where the model might generate inaccurate or false information.

Mathematical Problem-Solving: LLaVA faces limitations in mathematical problem-solving, an area where other models or systems might have superior capabilities.

Translation Tasks: LLaVA 1.5 struggles with translation tasks, indicating a potential limitation in language translation capabilities.





Privacy policy