AI Glossary/RLHF (Reinforcement Learning from Human Feedback)

What Is RLHF (Reinforcement Learning from Human Feedback)?

Definition

RLHF (Reinforcement Learning from Human Feedback) is an AI training technique where human evaluators rank model outputs by quality, and those rankings are used to train a reward model that guides the AI toward generating more helpful, accurate, and safe responses.

How RLHF (Reinforcement Learning from Human Feedback) Works

RLHF works in three stages: first, a language model is pre-trained on text data; second, human evaluators compare pairs of model outputs and indicate which is better; third, a reward model is trained on these preferences and used with reinforcement learning to fine-tune the original model. This process aligns the model's behavior with human values and expectations. RLHF was a key breakthrough behind ChatGPT's ability to follow instructions and refuse harmful requests, and it remains a foundational technique in AI safety and alignment research.

Real-World Examples

1

OpenAI using RLHF to train ChatGPT to follow user instructions rather than just predicting the next word

2

Anthropic applying RLHF to make Claude refuse harmful requests while remaining helpful for legitimate tasks

3

A team of human raters comparing two AI responses and selecting the more helpful one to train the reward model

V

RLHF (Reinforcement Learning from Human Feedback) on Vincony

Vincony's Compare Chat feature mirrors the RLHF evaluation process, letting users compare outputs from multiple models side by side to find the best response.

Try Vincony free →

Recommended Tools

Related Terms