VisionEst. 2024

WildVision-Bench

WildVision-Bench evaluates multimodal AI models on challenging real-world vision-language tasks collected from actual user interactions in the wild. Unlike curated academic datasets, WildVision captures the diversity and difficulty of how people naturally interact with vision-language models, including complex scene understanding, multi-image reasoning, and nuanced visual questions that require world knowledge and common sense.

Metrics

Win rate (%) against baseline in human evaluation

Created By

Yujie Lu et al. (WildVision Team)

Top Model Scores

RankModelScoreDate
1GPT-5.278.4%2026-03
2Gemini 3 Ultra76.9%2026-01
3Claude Opus 4.675.2%2026-02
4Grok 471.6%2026-02
5Llama 4 405B67.3%2026-01