SafetyDecember 15, 2022Anthropic

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon

Abstract

We experiment with methods for training a harmless AI assistant through a process we call Constitutional AI (CAI). The main idea is to use a set of principles (a constitution) to guide model behavior, using AI feedback to train the model to be helpful, harmless, and honest. This approach reduces the need for human feedback labels for harmlessness while achieving competitive or superior results.

Key Findings

1Introduced Constitutional AI (CAI) as a method for training harmless AI assistants
2Used AI-generated feedback guided by a set of principles rather than human labels
3Reduced the need for human feedback on harmlessness while maintaining helpfulness
4Demonstrated that AI can self-improve safety using constitutional principles
5Showed the method produces models that are both more helpful and less harmful

Impact & Significance

Constitutional AI became a foundational approach to AI safety, influencing how Anthropic builds Claude and inspiring the broader industry to adopt principle-based alignment. The paper demonstrated a scalable approach to making AI systems safer.

Related Tools

Claude

Read Full Paper

Constitutional AI: Harmlessness from AI Feedback

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku