VisionEst. 2024

WildVision-Bench

WildVision-Bench evaluates multimodal AI models on challenging real-world vision-language tasks collected from actual user interactions in the wild. Unlike curated academic datasets, WildVision captures the diversity and difficulty of how people naturally interact with vision-language models, including complex scene understanding, multi-image reasoning, and nuanced visual questions that require world knowledge and common sense.

Metrics

Win rate (%) against baseline in human evaluation

Created By

Yujie Lu et al. (WildVision Team)

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	78.4%	2026-03
2	Gemini 3 Ultra	76.9%	2026-01
3	Claude Opus 4.6	75.2%	2026-02
4	Grok 4	71.6%	2026-02
5	Llama 4 405B	67.3%	2026-01

Related Vision Benchmarks

VQAv2

Visual Question Answering v2 is a large-scale benchmark for visual question answering containing over 1 million questions about images from COCO. It tests the ability to answer open-ended questions that require understanding image content.

Top: Gemini 3 Ultra — 88.9%

DocVQA

Document Visual Question Answering evaluates the ability of models to understand and answer questions about document images including forms, invoices, scientific papers, and handwritten notes.

Top: Gemini 3 Ultra — 95.2%

ChartQA

ChartQA tests the ability of models to answer questions about charts and visualizations, requiring both visual understanding of chart elements and reasoning about the underlying data.

Top: GPT-5.2 — 90.1%

RealWorldQA

RealWorldQA evaluates vision-language models on practical, real-world visual understanding tasks including spatial reasoning about real photographs, reading text in images, understanding scenes, and answering practical questions.

Top: Gemini 3 Ultra — 79.6%

← Back to all benchmarks