MultimodalSeptember 25, 2023OpenAI

GPT-4V(ision) System Card

OpenAI

Abstract

This system card describes GPT-4 with vision (GPT-4V), which enables users to instruct GPT-4 to analyze image inputs. We describe the safety evaluations, mitigations, and deployment preparation for the multimodal capabilities of GPT-4V including visual question answering, image description, spatial reasoning, and document understanding.

Key Findings

1Extended GPT-4 with robust image understanding capabilities
2Demonstrated strong visual question answering and spatial reasoning
3Handled diverse image types: photos, diagrams, charts, screenshots, documents
4Included extensive safety evaluations for multimodal risks
5Enabled new applications in accessibility, education, and analysis

Impact & Significance

GPT-4V made multimodal AI practical and mainstream, enabling users to analyze images, documents, and visual content through conversation. It opened up new categories of AI applications in education, healthcare, and accessibility.

Related Tools

Chatgpt Openai Api

Read Full Paper

GPT-4V(ision) System Card

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku