SafetyBench
SafetyBench evaluates the safety of large language models across 7 categories: offensiveness, unfairness and bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It includes questions in both English and Chinese.
Metrics
Safety score (%) across 7 safety categories
Created By
Tsinghua University
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 91.7% | 2026-02 |
| 2 | GPT-5.2 | 89.3% | 2026-03 |
| 3 | Gemini 3 Ultra | 87.8% | 2026-01 |
| 4 | Llama 4 405B | 85.2% | 2026-01 |
| 5 | Grok 4 | 83.6% | 2026-02 |
Related Safety Benchmarks
TruthfulQA
TruthfulQA measures whether language models generate truthful answers to questions. It includes 817 questions spanning 38 categories where humans might give false answers due to misconceptions, superstitions, or conspiracy theories.
Top: Claude Opus 4.6 — 82.4%
ToxiGen
ToxiGen evaluates the propensity of language models to generate toxic content targeting 13 minority groups. It uses adversarially designed prompts to test whether models produce harmful implicit or explicit toxicity.
Top: Claude Opus 4.6 — 1.2%
TruthfulQA
TruthfulQA measures whether language models generate truthful answers to questions where humans commonly hold misconceptions. The benchmark covers 817 questions across 38 categories including health, law, finance, and conspiracy theories. It specifically targets questions where models are incentivized to reproduce popular falsehoods rather than provide accurate but less common truths, making it a key safety benchmark.
Top: Claude Opus 4.6 — 82.4%