MultimodalSeptember 25, 2023OpenAI
GPT-4V(ision) System Card
OpenAI
Abstract
This system card describes GPT-4 with vision (GPT-4V), which enables users to instruct GPT-4 to analyze image inputs. We describe the safety evaluations, mitigations, and deployment preparation for the multimodal capabilities of GPT-4V including visual question answering, image description, spatial reasoning, and document understanding.
Key Findings
- 1Extended GPT-4 with robust image understanding capabilities
- 2Demonstrated strong visual question answering and spatial reasoning
- 3Handled diverse image types: photos, diagrams, charts, screenshots, documents
- 4Included extensive safety evaluations for multimodal risks
- 5Enabled new applications in accessibility, education, and analysis
Impact & Significance
GPT-4V made multimodal AI practical and mainstream, enabling users to analyze images, documents, and visual content through conversation. It opened up new categories of AI applications in education, healthcare, and accessibility.
Related Tools
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic