ReasoningEst. 2023

MuSR

Multistep Soft Reasoning (MuSR) evaluates models on complex reasoning tasks that require multiple inference steps in domains like murder mysteries, team allocation puzzles, and object placements. Problems require 2-7 reasoning steps.

Metrics

Accuracy (%) on multistep reasoning

Created By

UT Austin

Top Model Scores

RankModelScoreDate
1GPT-5.271.8%2026-03
2Claude Opus 4.669.4%2026-02
3Gemini 3 Ultra66.7%2026-01
4Grok 463.2%2026-02
5DeepSeek V359.8%2026-01