49 Papers

AI Research Papers

Explore the most influential and landmark AI research papers — from the Transformer architecture to frontier LLMs, diffusion models, and AI safety breakthroughs.

Showing 49 of 49 papers

LLMJuly 23, 2024Meta AI

The Llama 3 Herd of Models

Meta AI

We present Llama 3, a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 across a range of tasks.

LLMJuly 15, 2024Alibaba Cloud / Qwen Team

Qwen2 Technical Report

Alibaba Cloud

We introduce Qwen2, the next generation of the Qwen series of large language models. Qwen2 includes dense language models of 0.5B, 1.5B, 7B, 57B-A14B (MoE), and 72B parameters, trained on data in 29 languages. Qwen2-72B achieves competitive performance with leading proprietary models on a wide range of benchmarks.

EfficiencyMay 7, 2024DeepSeek AI

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek AI

We present DeepSeek-V2, a strong Mixture-of-Experts language model characterized by innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76x.

LLMMarch 4, 2024Anthropic

The Claude 3 Model Family: Opus, Sonnet, and Haiku

Anthropic

We introduce the Claude 3 family of AI models: Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku. These models represent a significant leap in capabilities across reasoning, math, coding, multilingual understanding, and vision. Claude 3 Opus achieves near-human-level performance on expert knowledge benchmarks and sets new standards for AI safety and ethical behavior.

VisionFebruary 15, 2024OpenAI

Video Generation Models as World Simulators

OpenAI

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We find that scaling video generation models is a promising path towards building general purpose simulators of the physical world. Our largest model, Sora, is capable of generating a minute of high fidelity video.

EfficiencyDecember 12, 2023Microsoft Research

Phi-2: The Surprising Power of Small Language Models

Microsoft Research

We present Phi-2, a 2.7 billion parameter language model that demonstrates outstanding reasoning and language understanding capabilities, matching or outperforming models up to 25x larger. Phi-2 is trained on carefully curated synthetic and web data, showing that data quality can compensate for model size in achieving strong performance.

MultimodalDecember 6, 2023Google DeepMind

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind

We report on Gemini, a family of highly capable multimodal models that demonstrate strong generalist capabilities across image, audio, video, and text understanding. The Gemini Ultra model advances the state of the art in 30 of 32 benchmarks, achieving the first model to reach human-expert performance on the MMLU exam benchmark.

EfficiencyDecember 1, 2023Carnegie Mellon / Princeton

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

We introduce Mamba, a new architecture for sequence modeling based on structured state space models (SSMs) with a selection mechanism. Mamba achieves performance comparable to Transformers while scaling linearly with sequence length instead of quadratically. On language modeling, Mamba matches or exceeds Transformers of the same size while being 5x faster at inference.

EfficiencyOctober 10, 2023Mistral AI

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford + 6 more

We introduce Mistral 7B, a 7-billion parameter language model that outperforms the best open 13B model (Llama 2 13B) on all evaluated benchmarks and the best released 34B model (Llama 1 34B) on reasoning, math, and code generation. Mistral 7B uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) for handling longer sequences.

MultimodalSeptember 25, 2023OpenAI

GPT-4V(ision) System Card

OpenAI

This system card describes GPT-4 with vision (GPT-4V), which enables users to instruct GPT-4 to analyze image inputs. We describe the safety evaluations, mitigations, and deployment preparation for the multimodal capabilities of GPT-4V including visual question answering, image description, spatial reasoning, and document understanding.

VisionSeptember 20, 2023OpenAI

Improving Image Generation with Better Captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks + 6 more

We study how image generation models can be improved by training on better image captions. We develop an automatic captioning pipeline that generates highly descriptive image captions. Training text-to-image models on these improved captions substantially improves the quality and prompt-following ability of the resulting models, which we call DALL-E 3.

RoboticsJuly 28, 2023Google DeepMind

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar + 4 more

We study how vision-language models trained on internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. We introduce RT-2, a class of vision-language-action (VLA) models that are trained on both web data and robotics data, and show that they can directly output robot actions.

LLMJuly 18, 2023Meta AI

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert + 2 more

We develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform existing open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for some closed-source models.

EfficiencyJune 20, 2023Microsoft Research

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar Teodoro Mendes + 17 more

We introduce phi-1, a 1.3 billion parameter Transformer model for code generation, trained on a combination of filtered web data and synthetically generated textbook-quality data. Despite its small size, phi-1 achieves pass@1 accuracy of 50.6% on HumanEval and 55.5% on MBPP, substantially outperforming existing models of similar or even much larger size.

SafetyMay 29, 2023Stanford University

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon + 2 more

While RLHF has been effective for aligning LLMs, it is complex and unstable. We introduce Direct Preference Optimization (DPO), an algorithm that implicitly optimizes the same objective as RLHF but is simpler to implement and train. DPO eliminates the need for fitting a reward model, sampling from the LM, or performing RL optimization, while achieving comparable or superior performance.

LLMMay 17, 2023Princeton University / Google DeepMind

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran + 3 more

We introduce Tree of Thoughts (ToT), a framework that generalizes over chain-of-thought prompting and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows language models to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action.

MultimodalApril 17, 2023University of Wisconsin / Microsoft Research

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

We present LLaVA (Large Language and Vision Assistant), the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on generated data, LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting behaviors of multimodal GPT-4 on unseen images and instructions.

AgentsApril 7, 2023Stanford University / Google DeepMind

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris + 2 more

We introduce generative agents — computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, head to work, paint, write, form opinions, notice each other, and initiate conversations. We describe an architecture that extends a large language model to store a complete record of the agent's experiences, synthesize those memories into higher-level reflections, and retrieve them dynamically to plan behavior.

VisionApril 5, 2023Meta AI

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao + 8 more

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset ever, with over 1 billion masks on 11 million licensed and privacy-respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.

LLMMarch 15, 2023OpenAI

GPT-4 Technical Report

OpenAI

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.

LLMMarch 13, 2023Stanford University

Alpaca: A Strong, Replicable Instruction-Following Model

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois + 4 more

We demonstrate that fine-tuning Meta's LLaMA 7B model on 52K instruction-following demonstrations generated by GPT-3.5 produces a model that behaves qualitatively similarly to OpenAI's text-davinci-003. Alpaca costs less than $600 to reproduce, making it an accessible starting point for the research community to study instruction-following models.

AgentsFebruary 9, 2023Meta AI

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu + 4 more

We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

LLMDecember 20, 2022University of Washington / Allen AI

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu + 3 more

We introduce Self-Instruct, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. Our pipeline generates instructions, input, and output samples from a language model, then uses them to fine-tune the original model. Applying self-instruct to GPT-3 leads to a 33% absolute improvement over the original model on SuperNatural Instructions.

SafetyDecember 15, 2022Anthropic

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell + 6 more

We experiment with methods for training a harmless AI assistant through a process we call Constitutional AI (CAI). The main idea is to use a set of principles (a constitution) to guide model behavior, using AI feedback to train the model to be helpful, harmless, and honest. This approach reduces the need for human feedback labels for harmlessness while achieving competitive or superior results.

AudioDecember 6, 2022OpenAI

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman + 2 more

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are competitive with prior fully supervised results without the need for any fine-tuning.

EfficiencyDecember 5, 2022Google Research

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz + 5 more

We propose sparse upcycling, a simple approach to convert pre-trained dense models into Mixture-of-Experts (MoE) models. Starting from a dense checkpoint, we create expert copies and continue training with MoE routing. This approach outperforms both continued dense training and training MoE from scratch, while using existing pre-training investments.

EfficiencyNovember 30, 2022Google Research

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

We present speculative decoding, an algorithm to accelerate inference from large autoregressive models without any changes to the model outputs. The key idea is to use a smaller, faster draft model to generate candidate tokens that are then verified in parallel by the larger target model. This provides up to 3x speedup while producing the exact same output distribution.

SafetyOctober 19, 2022OpenAI

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, Jacob Hilton

In reinforcement learning from human feedback, it is common to optimize the policy against a learned reward model. We study how the gold reward score changes as we optimize against the proxy reward model. We find that this overoptimization can be characterized by scaling laws, and provide a theoretical framework for predicting when policies trained against proxy rewards will diverge from actual human preferences.

AgentsOctober 6, 2022Princeton University / Google Brain

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du + 3 more

We propose ReAct, a general paradigm that synergizes reasoning and acting in large language models. ReAct prompts LLMs to generate both verbal reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with and gather additional information from external sources.

EfficiencyMay 27, 2022Stanford University

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra + 1 more

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high-bandwidth memory (HBM) and GPU on-chip SRAM. FlashAttention is 2-4x faster than standard attention and enables up to 16x longer context lengths.

MultimodalMay 23, 2022Google Research

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li + 9 more

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. We discover that generic large language models, pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis. Imagen achieves a new state-of-the-art FID score on the COCO benchmark.

AgentsMay 12, 2022Google DeepMind

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo + 16 more

We introduce Gato, a single generalist agent that works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm, and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.

MultimodalApril 13, 2022OpenAI

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu + 1 more

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. We propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We call the resulting model DALL-E 2.

LLMApril 5, 2022Google Research

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma + 6 more

We trained a 540-billion parameter, dense decoder-only Transformer model, which we call Pathways Language Model (PaLM). PaLM achieves state-of-the-art few-shot learning results on hundreds of language understanding and generation tasks. It demonstrates breakthrough capabilities on reasoning tasks requiring multi-step logical inference.

EfficiencyMarch 29, 2022DeepMind

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya + 6 more

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained. We train a compute-optimal model, Chinchilla (70B parameters, 1.4T tokens), that uses the same compute as Gopher (280B) but outperforms it on nearly every benchmark.

SafetyMarch 4, 2022OpenAI

Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida + 6 more

Making language models bigger does not inherently make them better at following a user's intent. We show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback (RLHF). Our resulting model, InstructGPT, produces outputs preferred by humans over GPT-3 despite being 100x smaller in parameter count.

LLMJanuary 28, 2022Google Brain

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma + 5 more

We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning. We show that chain-of-thought prompting substantially outperforms standard prompting on arithmetic, commonsense, and symbolic reasoning benchmarks, with improvements most dramatic in the largest models.

OtherJanuary 6, 2022OpenAI

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin + 1 more

We show that neural networks can learn to generalize on algorithmic tasks long after memorizing the training data, a phenomenon we call grokking. In some cases, networks achieve perfect generalization thousands of training steps after reaching perfect training accuracy. This challenges conventional wisdom about the relationship between memorization and generalization.

VisionDecember 20, 2021LMU Munich / Stability AI

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser + 1 more

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results. We apply diffusion models in the latent space of powerful pretrained autoencoders, achieving a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity.

EfficiencyJune 17, 2021Microsoft

LoRA: Low-Rank Adaptation of Large Language Models

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu + 4 more

We propose Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. LoRA reduces the number of trainable parameters by 10,000x and the GPU memory requirement by 3x compared to full fine-tuning.

MultimodalFebruary 26, 2021OpenAI

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh + 8 more

We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn state-of-the-art image representations from scratch on a dataset of 400 million image-text pairs. CLIP models learn to connect images and text in a shared embedding space, enabling zero-shot transfer to downstream tasks.

EfficiencyJanuary 11, 2021Google Brain

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, Noam Shazeer

We introduce Switch Transformers, which simplify the Mixture of Experts (MoE) routing algorithm to route to a single expert, reducing computation and communication costs. Switch Transformers scale to trillion parameter models with the same computational cost as much smaller dense models, achieving up to 7x speedups in pre-training.

VisionOctober 22, 2020Google Research

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn + 8 more

While the Transformer architecture has become the de-facto standard for NLP tasks, its applications to computer vision remain limited. We show that a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. Vision Transformer (ViT) attains excellent results compared to state-of-the-art CNNs while requiring substantially fewer computational resources to train.

VisionJune 19, 2020UC Berkeley

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our models produce samples that are competitive with state-of-the-art GANs while enjoying desirable properties such as distribution coverage and a stationary training objective.

LLMMay 28, 2020OpenAI

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah + 6 more

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting. GPT-3 achieves strong performance on many NLP datasets without any gradient updates or fine-tuning.

LLMMay 22, 2020Meta AI / UCL / NYU

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni + 8 more

Large pre-trained language models have been shown to store factual knowledge in their parameters. However, their ability to access and precisely manipulate knowledge is still limited. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.

LLMJanuary 23, 2020OpenAI / Johns Hopkins

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown + 5 more

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training. Larger models are significantly more sample-efficient. Our results strongly suggest that larger models will continue to perform better, and are worth training even if some training runs are incomplete.

LLMOctober 11, 2018Google AI Language

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

LLMJune 12, 2017Google Brain / University of Toronto

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit + 4 more

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

Access All AI Models in One Place

Vincony gives you a single interface for ChatGPT, Claude, Gemini, and dozens more AI models. Try it free today.