MultimodalApril 17, 2023University of Wisconsin / Microsoft Research

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Abstract

We present LLaVA (Large Language and Vision Assistant), the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on generated data, LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting behaviors of multimodal GPT-4 on unseen images and instructions.

Key Findings

  • 1Used GPT-4 to generate multimodal instruction-following training data
  • 2Connected CLIP visual encoder with LLaMA language model for vision-language tasks
  • 3Achieved impressive visual chat abilities with relatively simple architecture
  • 4Demonstrated that visual instruction tuning can be done efficiently
  • 5Released model, data, and code as open source

Impact & Significance

LLaVA popularized the visual instruction tuning approach and inspired dozens of open-source multimodal models. It made multimodal AI accessible to the open-source community and influenced how vision-language models are built.

Read Full Paper