MultimodalApril 17, 2023University of Wisconsin / Microsoft Research

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Abstract

We present LLaVA (Large Language and Vision Assistant), the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on generated data, LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting behaviors of multimodal GPT-4 on unseen images and instructions.

Key Findings

1Used GPT-4 to generate multimodal instruction-following training data
2Connected CLIP visual encoder with LLaMA language model for vision-language tasks
3Achieved impressive visual chat abilities with relatively simple architecture
4Demonstrated that visual instruction tuning can be done efficiently
5Released model, data, and code as open source

Impact & Significance

LLaVA popularized the visual instruction tuning approach and inspired dozens of open-source multimodal models. It made multimodal AI accessible to the open-source community and influenced how vision-language models are built.

Related Tools

Llama Hugging Face Ollama

Read Full Paper

Visual Instruction Tuning

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku