SafetyEst. 2022

ToxiGen

ToxiGen evaluates the propensity of language models to generate toxic content targeting 13 minority groups. It uses adversarially designed prompts to test whether models produce harmful implicit or explicit toxicity.

Metrics

Toxicity rate (%, lower is better)

Created By

Microsoft Research

Top Model Scores

RankModelScoreDate
1Claude Opus 4.61.2%2026-02
2GPT-5.21.8%2026-03
3Gemini 3 Ultra2.3%2026-01
4Llama 4 405B3.1%2026-01
5Grok 43.7%2026-02