Research

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

April 21, 2025

・

Austin Silveria, Soham Govande, Dan Fu

TL;DR: We present Chipmunk, a training-free method to accelerate diffusion transformers with hardware-aware dynamic sparsity. Chipmunk caches attention weights and MLP activations from previous steps and dynamically computes a sparse “delta” against the cached weights. Chipmunk achieves up to 3.7x faster video generation on HunyuanVideo at 720x1280 resolution for a 5s video, and 1.6x faster image generations on FLUX.1-dev at 1280x768 resolution.

This blog is cross-posted to the Sandy Research blog at UCSD. Check out Part II and Part III on the Sandy Research blog for a deeper dive into the sparsity patterns and the kernels behind Chipmunk!

*Images of cute chipmunks can be generated 1.37x faster! Left: Fully Dense FLUX.1-dev. Right: Ours (84% sparse attention and 70% sparse MLP)*

Motivation: Diffusion Transformers (DiTs) have become the standard for video generation, but the time and cost of generation keeps them out of reach of many applications. We raise two questions: (1) What do the model activations want to do? (2) What does the hardware want to do? We then use these insights to design hardware-friendly algorithms that maximize quality per unit of generation time.

In this post, we unpack:

Slow-Changing, Sparse Activations: DiT activations for MLP and attention change slowly across steps, and they are naturally sparse.
Cross-Step Deltas: Because of the slow changing activations and natural sparsity, reformulating them to compute cross-step deltas make them even sparser.
Hardware-Aware Sparsity Pattern: For both attention and MLP, we can pack dense shared memory tiles from non-contiguous columns in global memory. We open-source fast kernels for this!

But first, a preview of our results:

Hunyuan	Latency	Speedup	VBench Quality	VB Semantic	VB Total	Resolution	Sparsity
FlashAttention-3	1030s	1x	85.09%	75.82%	83.24%	720x1280x129	0%
Sliding Tile Attention (Training-Free)	945s -> 527s	1.79x	84.63%	73.83%	82.46%	768x1280x117	58%
Chipmunk (Training-Free)	1030s -> 477s	2.16x	84.60%	76.29%	82.94%	720x1280x129	82% *
Chipmunk + Step Caching (Training-Free)	1030s -> 277s	3.72x	84.22%	75.60%	82.50%	720x1280x129	87%

* 93% sparsity on 44 out of 50 steps for an average of 82% sparsity.

FLUX.1-dev* (bf16)	ImageReward	MLP Sparsity	Attn Sparsity	Speedup
Baseline (with FlashAttention-3)	76.6%	0%	0%	1x
Chipmunk	80.2%	70%	83.5%	1.37x
Chipmunk + Step Caching	78.0%	70%	83.5%	1.63x

These FLUX.1-dev numbers were evaluated on 1280x768 images, and we’ve found that if we increase image size to 2304x1280, we can get speedups of up to 1.65x per-image without stacking on top of step caching methods, and 1.9x with step caching! We’ve also found that we can sparsify FP8 Flux to get a 1.1x end-to-end speedup over the fastest open-source implementation.

Slow-Changing, Sparse Activations

Chipmunk exploits two simple observations about diffusion transformers:

Activations move slowly: In each step a Diffusion Transformer (DiT) denoises a latent noise vector. This noise vector changes slowly across successive steps in the diffusion process – and so do the per-layer activations.
Activations are sparse: In attention, it is common to see queries place a very large percentage of their attention probability mass on a small subset of keys–this means that the output will mostly be made up of the small subset of associated rows of $V$. And in MLP, previous works have observed significant sparsity in the intermediate activations of both ReLU and GeLU-based layers, meaning that the output will mostly be made up of the top activated rows of $W_2$.

Activation Deltas Across Diffusion Steps are Very Sparse

Chipmunk uses these two observations to reduce the compute costs of the diffusion model – we can effectively capture nearly all the cross-step changes in the activations by only recomputing a small subset of attention and MLP.

What does this mean, concretely? Let’s revisit the attention and MLP equations:

Attention: $\text{softmax}(Q @ K^T) @ V)$
‍MLP: $\text{gelu}(x @ W_1) @ W_2)$

Both operations use a non-linearity to compute the scalar coefficients for a linear combination of value vectors. In attention, the value vectors are dynamic ($V$ is projected from the current token representation). In MLP, the value vectors are static (rows of the weights $W_2$). Thus, in attention, our outputs are a sum of scaled rows in the V matrix, and in MLP, our outputs are a sum of scaled rows in the $W_2$ matrix (the bias is one extra static vector). We can visualize these individual vectors as being summed to produce the total operation output.

Chipmunk’s key insight is that the value vectors (the colored columns of v above) change slowly, as do the scalar weights themselves (the colored values in the attention matrix above). Chipmunk caches the value vectors and the scalar weights, and dynamically chooses which ones to recompute in each step:

Given an attention/MLP output cache, an equivalent definition of a normal dense forward pass on step $n$ is the following: Subtract all of step $n-1$’s output vectors from the cache, and add all of step $n$’s new vectors. Therefore, given the natural sparsity in intermediate matrices, we can reformulate attention and MLP to compute a delta based on the previous step’s outputs. That is, we replace a subset of the output vectors and reuse the rest from the previous step. The output vectors that we replace correspond to sparsifying keys/values at the granularity of a single token in the intermediate matrix.

Hardware-Efficient Sparsity Pattern

The sparsity pattern we’ve been describing thus far, recomputing individual scaled output vectors for each token, corresponds to [1, 1] unstructured sparsity on the intermediate activations. GPUs do not like this. What they do like is computing large blocks at once, in the size ballpark of [128, 256] (in the current generation). This corresponds to 128 contiguous tokens and 256 contiguous keys/values.

Computing with block sparsity that aligns with the native tile sizes of the kernel is essentially free because the GPU is using the same large matrix multiplication sizes and skips full blocks of work.

However, there is one optimization we can make to efficiently get to [128, 1] column sparsity. Looking at our matrix multiplication diagram, let’s think through what happens if we reorder the columns of $k^t$ and $v^t$. A reordering of $k^t$ will apply the same reordering to the columns of $A = q @ k^t$. And if we apply the same reordering to $v^t$, then the end result $o$ is actually the same because the columns of $A$ still align with the correct columns of $v^t$.

What this allows us to do is compute attention or MLP with any ordering of the keys/values in shared memory–thus we can pack our sparse keys/values from non-contiguous rows in global memory into a dense tile in shared memory.

The more granular loads incur a small performance penalty, but we find that the sparsity levels make up for this–e.g. at 93% sparsity, our column-sparse attention kernel in ThunderKittens is ~10x times faster than the dense baseline.

Ok, so now we’re working with [128, 1] column sparsity, which corresponds to 128 contiguous tokens recomputing the same set of individual output vectors across steps. Intuitively, we expect that small 2D patches of an image have similar color and brightness. And in video, we expect the same for small 3D cubes (voxels). Yet, the natural token order is raster order from left to right, top down, and frame zero onwards. To create 128-size chunks with the most similar tokens, we reorder the tokens (and RoPe embeddings) once at the beginning of the diffusion process such that a chunk in the flattened sequence corresponds to a patch/voxel. These similar tokens, which we expect to interact with similar keys/values, now share the same set of sparse indices because they occupy contiguous rows of the input matrix. At the end of the diffusion process, we then reverse this reordering before decoding to pixel space.

Kernel Optimizations

Our kernel optimizations achieve efficient dynamic sparsity and caching through:

Fast sparsity identification: We fuse custom kernels to quickly compute sparse indices by reusing softmax constants and implementing a fast approximate top-k CUDA kernel with shared memory atomics, which is ≥2x faster than PyTorch’s native implementations
Fast cache writeback: We use the CUDA driver API to overlap the cache writeback with subsequent GEMM computations by allocating leftover streaming multiprocessors (SMs) to custom TMA-based reduction kernels (with PTX instructions like cp.reduce.async.bulk) during the tail effects of wave quantization, achieving a 2x speedup over naive implementations and saving ~20 microseconds per MLP invocation.
Warp-Specialized Persistent Kernel: We let the producer warpgroup’s memory loads overlap with consumer epilogues (which are expensive because of all the caching computation), and store swizzle offsets in registers, minimizing address computation overhead when using granular cp.async loads instead of TMA.

Come and play with Chipmunks!

The only thing we love more than chipmunks is the open-source community! Check out our GitHub repository and make your image and video models go brrrr. This post was just a sneak peek—we’re also releasing in-depth technical blogs on a deep dive through the math and kernel optimizations.

*We’re big fans of ThunderKittens, and so are our chipmunks! Our sparse attention and MLP kernels let our chipmunks play nicely with their kitten friends.*

At Together AI, we’re constantly pushing the state of the art in model acceleration to serve the fastest models at the lowest cost: FLUX-1.dev, DeepSeek R1, Llama 4. We’re excited to continue our research to extend granular sparsity across more model architectures and integrate with training algorithms for even more acceleration.

Lower
Cost
20%
faster
training
4x
network
compression
117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Links in this
article