WildVision-Bench
WildVision-Bench evaluates multimodal AI models on challenging real-world vision-language tasks collected from actual user interactions in the wild. Unlike curated academic datasets, WildVision captures the diversity and difficulty of how people naturally interact with vision-language models, including complex scene understanding, multi-image reasoning, and nuanced visual questions that require world knowledge and common sense.
Metrics
Win rate (%) against baseline in human evaluation
Created By
Yujie Lu et al. (WildVision Team)
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 78.4% | 2026-03 |
| 2 | Gemini 3 Ultra | 76.9% | 2026-01 |
| 3 | Claude Opus 4.6 | 75.2% | 2026-02 |
| 4 | Grok 4 | 71.6% | 2026-02 |
| 5 | Llama 4 405B | 67.3% | 2026-01 |
Related Vision Benchmarks
VQAv2
Visual Question Answering v2 is a large-scale benchmark for visual question answering containing over 1 million questions about images from COCO. It tests the ability to answer open-ended questions that require understanding image content.
Top: Gemini 3 Ultra — 88.9%
DocVQA
Document Visual Question Answering evaluates the ability of models to understand and answer questions about document images including forms, invoices, scientific papers, and handwritten notes.
Top: Gemini 3 Ultra — 95.2%
ChartQA
ChartQA tests the ability of models to answer questions about charts and visualizations, requiring both visual understanding of chart elements and reasoning about the underlying data.
Top: GPT-5.2 — 90.1%
RealWorldQA
RealWorldQA evaluates vision-language models on practical, real-world visual understanding tasks including spatial reasoning about real photographs, reading text in images, understanding scenes, and answering practical questions.
Top: Gemini 3 Ultra — 79.6%