SafetyDecember 15, 2022Anthropic

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon

Abstract

We experiment with methods for training a harmless AI assistant through a process we call Constitutional AI (CAI). The main idea is to use a set of principles (a constitution) to guide model behavior, using AI feedback to train the model to be helpful, harmless, and honest. This approach reduces the need for human feedback labels for harmlessness while achieving competitive or superior results.

Key Findings

  • 1Introduced Constitutional AI (CAI) as a method for training harmless AI assistants
  • 2Used AI-generated feedback guided by a set of principles rather than human labels
  • 3Reduced the need for human feedback on harmlessness while maintaining helpfulness
  • 4Demonstrated that AI can self-improve safety using constitutional principles
  • 5Showed the method produces models that are both more helpful and less harmful

Impact & Significance

Constitutional AI became a foundational approach to AI safety, influencing how Anthropic builds Claude and inspiring the broader industry to adopt principle-based alignment. The paper demonstrated a scalable approach to making AI systems safer.

Related Tools

Read Full Paper