MultimodalApril 17, 2023University of Wisconsin / Microsoft Research
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Abstract
We present LLaVA (Large Language and Vision Assistant), the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on generated data, LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting behaviors of multimodal GPT-4 on unseen images and instructions.
Key Findings
- 1Used GPT-4 to generate multimodal instruction-following training data
- 2Connected CLIP visual encoder with LLaMA language model for vision-language tasks
- 3Achieved impressive visual chat abilities with relatively simple architecture
- 4Demonstrated that visual instruction tuning can be done efficiently
- 5Released model, data, and code as open source
Impact & Significance
LLaVA popularized the visual instruction tuning approach and inspired dozens of open-source multimodal models. It made multimodal AI accessible to the open-source community and influenced how vision-language models are built.
Related Tools
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic