AI Research Papers
Explore the most influential and landmark AI research papers — from the Transformer architecture to frontier LLMs, diffusion models, and AI safety breakthroughs.
Showing 49 of 49 papers
The Llama 3 Herd of Models
Meta AI
We present Llama 3, a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 across a range of tasks.
Qwen2 Technical Report
Alibaba Cloud
We introduce Qwen2, the next generation of the Qwen series of large language models. Qwen2 includes dense language models of 0.5B, 1.5B, 7B, 57B-A14B (MoE), and 72B parameters, trained on data in 29 languages. Qwen2-72B achieves competitive performance with leading proprietary models on a wide range of benchmarks.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
We present DeepSeek-V2, a strong Mixture-of-Experts language model characterized by innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76x.
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic
We introduce the Claude 3 family of AI models: Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku. These models represent a significant leap in capabilities across reasoning, math, coding, multilingual understanding, and vision. Claude 3 Opus achieves near-human-level performance on expert knowledge benchmarks and sets new standards for AI safety and ethical behavior.
Video Generation Models as World Simulators
OpenAI
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We find that scaling video generation models is a promising path towards building general purpose simulators of the physical world. Our largest model, Sora, is capable of generating a minute of high fidelity video.
Phi-2: The Surprising Power of Small Language Models
Microsoft Research
We present Phi-2, a 2.7 billion parameter language model that demonstrates outstanding reasoning and language understanding capabilities, matching or outperforming models up to 25x larger. Phi-2 is trained on carefully curated synthetic and web data, showing that data quality can compensate for model size in achieving strong performance.
Gemini: A Family of Highly Capable Multimodal Models
Google DeepMind
We report on Gemini, a family of highly capable multimodal models that demonstrate strong generalist capabilities across image, audio, video, and text understanding. The Gemini Ultra model advances the state of the art in 30 of 32 benchmarks, achieving the first model to reach human-expert performance on the MMLU exam benchmark.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu, Tri Dao
We introduce Mamba, a new architecture for sequence modeling based on structured state space models (SSMs) with a selection mechanism. Mamba achieves performance comparable to Transformers while scaling linearly with sequence length instead of quadratically. On language modeling, Mamba matches or exceeds Transformers of the same size while being 5x faster at inference.
Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford + 6 more
We introduce Mistral 7B, a 7-billion parameter language model that outperforms the best open 13B model (Llama 2 13B) on all evaluated benchmarks and the best released 34B model (Llama 1 34B) on reasoning, math, and code generation. Mistral 7B uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) for handling longer sequences.
GPT-4V(ision) System Card
OpenAI
This system card describes GPT-4 with vision (GPT-4V), which enables users to instruct GPT-4 to analyze image inputs. We describe the safety evaluations, mitigations, and deployment preparation for the multimodal capabilities of GPT-4V including visual question answering, image description, spatial reasoning, and document understanding.
Improving Image Generation with Better Captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks + 6 more
We study how image generation models can be improved by training on better image captions. We develop an automatic captioning pipeline that generates highly descriptive image captions. Training text-to-image models on these improved captions substantially improves the quality and prompt-following ability of the resulting models, which we call DALL-E 3.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar + 4 more
We study how vision-language models trained on internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. We introduce RT-2, a class of vision-language-action (VLA) models that are trained on both web data and robotics data, and show that they can directly output robot actions.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert + 2 more
We develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform existing open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for some closed-source models.
Textbooks Are All You Need
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar Teodoro Mendes + 17 more
We introduce phi-1, a 1.3 billion parameter Transformer model for code generation, trained on a combination of filtered web data and synthetically generated textbook-quality data. Despite its small size, phi-1 achieves pass@1 accuracy of 50.6% on HumanEval and 55.5% on MBPP, substantially outperforming existing models of similar or even much larger size.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon + 2 more
While RLHF has been effective for aligning LLMs, it is complex and unstable. We introduce Direct Preference Optimization (DPO), an algorithm that implicitly optimizes the same objective as RLHF but is simpler to implement and train. DPO eliminates the need for fitting a reward model, sampling from the LM, or performing RL optimization, while achieving comparable or superior performance.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran + 3 more
We introduce Tree of Thoughts (ToT), a framework that generalizes over chain-of-thought prompting and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows language models to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action.
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
We present LLaVA (Large Language and Vision Assistant), the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on generated data, LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting behaviors of multimodal GPT-4 on unseen images and instructions.
Generative Agents: Interactive Simulacra of Human Behavior
Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris + 2 more
We introduce generative agents — computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, head to work, paint, write, form opinions, notice each other, and initiate conversations. We describe an architecture that extends a large language model to store a complete record of the agent's experiences, synthesize those memories into higher-level reflections, and retrieve them dynamically to plan behavior.
Segment Anything
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao + 8 more
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset ever, with over 1 billion masks on 11 million licensed and privacy-respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.
GPT-4 Technical Report
OpenAI
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
Alpaca: A Strong, Replicable Instruction-Following Model
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois + 4 more
We demonstrate that fine-tuning Meta's LLaMA 7B model on 52K instruction-following demonstrations generated by GPT-3.5 produces a model that behaves qualitatively similarly to OpenAI's text-davinci-003. Alpaca costs less than $600 to reproduce, making it an accessible starting point for the research community to study instruction-following models.
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu + 4 more
We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu + 3 more
We introduce Self-Instruct, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. Our pipeline generates instructions, input, and output samples from a language model, then uses them to fine-tune the original model. Applying self-instruct to GPT-3 leads to a 33% absolute improvement over the original model on SuperNatural Instructions.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell + 6 more
We experiment with methods for training a harmless AI assistant through a process we call Constitutional AI (CAI). The main idea is to use a set of principles (a constitution) to guide model behavior, using AI feedback to train the model to be helpful, harmless, and honest. This approach reduces the need for human feedback labels for harmlessness while achieving competitive or superior results.
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman + 2 more
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are competitive with prior fully supervised results without the need for any fine-tuning.
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz + 5 more
We propose sparse upcycling, a simple approach to convert pre-trained dense models into Mixture-of-Experts (MoE) models. Starting from a dense checkpoint, we create expert copies and continue training with MoE routing. This approach outperforms both continued dense training and training MoE from scratch, while using existing pre-training investments.
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias
We present speculative decoding, an algorithm to accelerate inference from large autoregressive models without any changes to the model outputs. The key idea is to use a smaller, faster draft model to generate candidate tokens that are then verified in parallel by the larger target model. This provides up to 3x speedup while producing the exact same output distribution.
Scaling Laws for Reward Model Overoptimization
Leo Gao, John Schulman, Jacob Hilton
In reinforcement learning from human feedback, it is common to optimize the policy against a learned reward model. We study how the gold reward score changes as we optimize against the proxy reward model. We find that this overoptimization can be characterized by scaling laws, and provide a theoretical framework for predicting when policies trained against proxy rewards will diverge from actual human preferences.
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du + 3 more
We propose ReAct, a general paradigm that synergizes reasoning and acting in large language models. ReAct prompts LLMs to generate both verbal reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with and gather additional information from external sources.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra + 1 more
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high-bandwidth memory (HBM) and GPU on-chip SRAM. FlashAttention is 2-4x faster than standard attention and enables up to 16x longer context lengths.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li + 9 more
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. We discover that generic large language models, pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis. Imagen achieves a new state-of-the-art FID score on the COCO benchmark.
A Generalist Agent
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo + 16 more
We introduce Gato, a single generalist agent that works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm, and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu + 1 more
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. We propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We call the resulting model DALL-E 2.
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma + 6 more
We trained a 540-billion parameter, dense decoder-only Transformer model, which we call Pathways Language Model (PaLM). PaLM achieves state-of-the-art few-shot learning results on hundreds of language understanding and generation tasks. It demonstrates breakthrough capabilities on reasoning tasks requiring multi-step logical inference.
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya + 6 more
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained. We train a compute-optimal model, Chinchilla (70B parameters, 1.4T tokens), that uses the same compute as Gopher (280B) but outperforms it on nearly every benchmark.
Training Language Models to Follow Instructions with Human Feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida + 6 more
Making language models bigger does not inherently make them better at following a user's intent. We show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback (RLHF). Our resulting model, InstructGPT, produces outputs preferred by humans over GPT-3 despite being 100x smaller in parameter count.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma + 5 more
We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning. We show that chain-of-thought prompting substantially outperforms standard prompting on arithmetic, commonsense, and symbolic reasoning benchmarks, with improvements most dramatic in the largest models.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin + 1 more
We show that neural networks can learn to generalize on algorithmic tasks long after memorizing the training data, a phenomenon we call grokking. In some cases, networks achieve perfect generalization thousands of training steps after reaching perfect training accuracy. This challenges conventional wisdom about the relationship between memorization and generalization.
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser + 1 more
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results. We apply diffusion models in the latent space of powerful pretrained autoencoders, achieving a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity.
LoRA: Low-Rank Adaptation of Large Language Models
Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu + 4 more
We propose Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. LoRA reduces the number of trainable parameters by 10,000x and the GPU memory requirement by 3x compared to full fine-tuning.
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh + 8 more
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn state-of-the-art image representations from scratch on a dataset of 400 million image-text pairs. CLIP models learn to connect images and text in a shared embedding space, enabling zero-shot transfer to downstream tasks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, Noam Shazeer
We introduce Switch Transformers, which simplify the Mixture of Experts (MoE) routing algorithm to route to a single expert, reducing computation and communication costs. Switch Transformers scale to trillion parameter models with the same computational cost as much smaller dense models, achieving up to 7x speedups in pre-training.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn + 8 more
While the Transformer architecture has become the de-facto standard for NLP tasks, its applications to computer vision remain limited. We show that a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. Vision Transformer (ViT) attains excellent results compared to state-of-the-art CNNs while requiring substantially fewer computational resources to train.
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, Pieter Abbeel
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our models produce samples that are competitive with state-of-the-art GANs while enjoying desirable properties such as distribution coverage and a stationary training objective.
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah + 6 more
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting. GPT-3 achieves strong performance on many NLP datasets without any gradient updates or fine-tuning.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni + 8 more
Large pre-trained language models have been shown to store factual knowledge in their parameters. However, their ability to access and precisely manipulate knowledge is still limited. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown + 5 more
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training. Larger models are significantly more sample-efficient. Our results strongly suggest that larger models will continue to perform better, and are worth training even if some training runs are incomplete.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit + 4 more
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Access All AI Models in One Place
Vincony gives you a single interface for ChatGPT, Claude, Gemini, and dozens more AI models. Try it free today.