SafetyMarch 4, 2022OpenAI

Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray

Abstract

Making language models bigger does not inherently make them better at following a user's intent. We show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback (RLHF). Our resulting model, InstructGPT, produces outputs preferred by humans over GPT-3 despite being 100x smaller in parameter count.

Key Findings

1Demonstrated RLHF as an effective method for aligning LLMs with human preferences
2Showed that a 1.3B InstructGPT model was preferred over the 175B GPT-3
3Used three-step process: supervised fine-tuning, reward model training, PPO optimization
4Reduced toxic and untruthful outputs significantly compared to base models
5Established the paradigm used to train ChatGPT and subsequent assistants

Impact & Significance

This paper established RLHF as the standard technique for making LLMs useful and safe. The InstructGPT methodology directly led to ChatGPT and influenced virtually every AI assistant built since, making it one of the most impactful alignment papers.

Related Tools

Chatgpt Openai Api

Read Full Paper

Training Language Models to Follow Instructions with Human Feedback

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku