Together Research

Foundational research for production AI

recognized by

Our research areas

Inference
Design and optimization of production inference systems, spanning scheduling, batching, and hardware–software co-design for reliable high throughput.
Read papers
Kernels
Development of high-performance GPU kernels for training and inference, optimizing memory, attention, and custom operators at production scale.
Read papers
Model Shaping
Advancement of post-training methods like fine-tuning, distillation, and quantization to shape efficient, controllable model behavior.
Read papers
Agents
Studies of long-horizon reasoning and decision-making, focusing on tool use, multi-step planning, and reinforcement learning for reliable agentic systems.
Read papers

Recognized research

Papers accepted at top conferences

Spotlight · ICLR

ThunderKittens: Simple, Fast, and Adorable AI Kernels

Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher Ré

Outstanding Paper · COLM

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

Best Paper · ICML HAET

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

MLSys

CDLM: Consistency diffusion language models for faster sampling

Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami

MLSys

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang

MLSys

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Leon Song

MLSys

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Simran Arora

ICLR

FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

Max Ryabinin

ICLR

When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework

Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, Ce Zhang

ICLR

Speculative Speculative Decoding

Tanishq Kumar, Tri Dao, Avner May

Spotlight · NeurIPS Datasets and Benchmarks

RedPajama: an Open Dataset for Training Large Language Models

Maurice Weber, Daniel Y. Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, Ce Zhang

ICML, ME-FoMo

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Jon Saad-Falcon, Daniel Y. Fu, Simran Arora, Neel Guha, Christopher Ré

Spotlight · ICML

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

Oral · NeurIPS

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Dan Fu, Simran Arora, Jessica Grogan, Isys Johnson, Evan Sabri Eyuboglu, Armin Thomas, Benjamin Spector, Michael Poli, Atri Rudra, Christopher Ré

Oral · ICML

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Re

Oral · ICLR

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Dan Fu, Tri Dao, Khaled Saab, Armin Thomas, Atri Rudra, Christopher Re

Oral · ICML

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen

Oral · ICML

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang

View all

Key open-source projects

FlashAttention
IO-aware exact attention, universally adopted
Read More
Flash Decoding
8× faster long-context token generation
Read More
Mixture of Agents
Open models, working together, beat GPT-4o
Read More
Dragonfly
Tiny 8B model beats Med-Gemini on every benchmark
Read More
Red Pajama Datasets
100T+ tokens powering 500+ models
Read More
DeepCoder
First open model to match o3-mini on code
Read More
Open Deep Research
Open-source multi-model deep research agent
Read More
Open Data Scientist Agent
Autonomous agent tops Adyen's real-world benchmark
Read More

Research blogs

View all

Research

Key research and product announcements at the AI Native Conf

At AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI Native Cloud.

Research

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to softmax exponentials.

Research

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.