Research

ThunderKittens: A Simple Embedded DSL for AI kernels

May 12, 2024

・

Benjamin Spector, Aaryan Singhal, Simran Arora, Chris Re

We have a lot of fun building kernels for AI, and we’re proud of our open-source AI contributions. We’re less proud that people regard FlashAttention as alien technology. In a NeurIPS keynote, our group tried our best to make the key ideas simple–but the gap from the beautiful pictures generated by Dan Fu to actual “gpu go brr” cuda is still too damn high. So we set out to write a simple framework to make it really easy to express the key technical ideas. Relatively quickly, we had a small library (DSL?) that we called ThunderKittens that we hope lets us write simple-to-understand clean code that indeed makes gpus go brrr.

Why ThunderKittens? We wanted something that was adorable and cute, and there are too many macho head games in the current names. Let’s be fun and welcoming!

Our observations for TK are pretty simple:

You want to keep tensor cores busy. These things are 94% of the compute on an H100, so if you don’t use them you’re capped at 6% utilization. In TK, we make the fundamental object a matrix (tile) of the same size that fits into the tensor core.
We want something familiar to AI people. So we want an API that is PyTorch-like, because those folks are amazing and we love them!
We wanted to get the full power of the host (cuda or hip), and we don’t want to hide how accelerators work. We think accelerators are awesome and more AI people should play around at this intersection. Pragmatically, AI processors are changing a lot – they have to! – but this means to take advantage of the latest and greatest you need to break out of software abstractions. So, TK is an embedded DSL somewhere between NVIDIA’s cute (in being embedded in raw cuda) and Triton (in being focused on AI kernels). If you know cuda, you can probably “compile” TK in your head–we try to make the structures transparent.

You may have seen oblique references to TK in some of our earlier papers, e.g., based. We’ve been using it, so … so why are we releasing it now? We don’t know honestly. But here are some things we think are interesting:

On 4090s and A100s, TK matches FA2 performance–in just a few lines of code.
On H100s, TK is faster forward and backward than FA2 by quite a bit–so there is no tradeoff of clean versus speed.

The lab folks wrote a bunch of other kernels in TK things like Based, hedgehog, and more! Some of them we couldn’t figure out how to get similar performance in Triton.
AI folks who attended a two-hour cuda session have been able to write code, which is one step toward making this easier to use.

We do want you to understand that this is an art project; if you like it, great! If you don’t, that’s ok too… feedback is great! But do not expect that we’ll do anything about your complaints in the code. For now, TK is for our fun and useful for us, and we hope it makes key ideas clear. To help in that way, we’re posting a version integrated with Andrej’s awesome NanoGPT project, NanoGPT-TK. Andrej is one of the best communicators in AI and we’ve been using NanoGPT in teaching, so we’re delighted to build on top. For more details check out this longer post on Hazy Research.

Lower
Cost
20%
faster
training
4x
network
compression
117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Links in this
article

‍

ThunderKittens: A Simple Embedded DSL for AI kernels

Q: Should I use the RedPajama-V2 Dataset out of the box?

Subscribe to newsletter