SafetyEst. 2021

TruthfulQA

TruthfulQA measures whether language models generate truthful answers to questions. It includes 817 questions spanning 38 categories where humans might give false answers due to misconceptions, superstitions, or conspiracy theories.

Metrics

Truthfulness (%) on 817 questions

Created By

Stephanie Lin et al.

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	Claude Opus 4.6	82.4%	2026-02
2	GPT-5.2	80.1%	2026-03
3	Gemini 3 Ultra	78.6%	2026-01
4	Grok 4	76.3%	2026-02
5	Llama 4 405B	74.9%	2026-01

Related Safety Benchmarks

SafetyBench

SafetyBench evaluates the safety of large language models across 7 categories: offensiveness, unfairness and bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It includes questions in both English and Chinese.

Top: Claude Opus 4.6 — 91.7%

ToxiGen

ToxiGen evaluates the propensity of language models to generate toxic content targeting 13 minority groups. It uses adversarially designed prompts to test whether models produce harmful implicit or explicit toxicity.

Top: Claude Opus 4.6 — 1.2%

← Back to all benchmarks