Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell
Abstract
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting. GPT-3 achieves strong performance on many NLP datasets without any gradient updates or fine-tuning.
Key Findings
- 1Demonstrated that 175B parameter models exhibit strong few-shot learning abilities
- 2Showed that in-context learning emerges at sufficient scale
- 3Achieved competitive results without any gradient updates or fine-tuning
- 4Revealed scaling laws: bigger models show qualitatively different capabilities
- 5Introduced the concept of prompting as a new paradigm for using LLMs
Impact & Significance
GPT-3 launched the era of large language models and demonstrated that scale enables qualitatively new capabilities. It made AI accessible through APIs and natural language prompts, directly enabling the creation of ChatGPT and the AI application ecosystem.
Related Tools
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic