VisionEst. 2020

DocVQA

Document Visual Question Answering evaluates the ability of models to understand and answer questions about document images including forms, invoices, scientific papers, and handwritten notes.

Metrics

ANLS score on document questions

Created By

CVC Barcelona

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	Gemini 3 Ultra	95.2%	2026-01
2	GPT-5.2	94.7%	2026-03
3	Claude Opus 4.6	93.8%	2026-02
4	InternVL 3	91.4%	2026-01
5	Qwen2-VL 72B	89.6%	2025-12

Related Vision Benchmarks

VQAv2

Visual Question Answering v2 is a large-scale benchmark for visual question answering containing over 1 million questions about images from COCO. It tests the ability to answer open-ended questions that require understanding image content.

Top: Gemini 3 Ultra — 88.9%

ChartQA

ChartQA tests the ability of models to answer questions about charts and visualizations, requiring both visual understanding of chart elements and reasoning about the underlying data.

Top: GPT-5.2 — 90.1%

RealWorldQA

RealWorldQA evaluates vision-language models on practical, real-world visual understanding tasks including spatial reasoning about real photographs, reading text in images, understanding scenes, and answering practical questions.

Top: Gemini 3 Ultra — 79.6%

WildVision-Bench

WildVision-Bench evaluates multimodal AI models on challenging real-world vision-language tasks collected from actual user interactions in the wild. Unlike curated academic datasets, WildVision captures the diversity and difficulty of how people naturally interact with vision-language models, including complex scene understanding, multi-image reasoning, and nuanced visual questions that require world knowledge and common sense.

Top: GPT-5.2 — 78.4%

← Back to all benchmarks