Research

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

March 12, 2024

・

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

Introduction

We introduce Sequoia, a scalable, robust, and hardware-aware speculative decoding framework that improves LLM inference speed on consumer GPUs (with offloading), as well as on high-end GPUs (on-chip), without any approximations. We show below that Sequoia—by creating large trees of speculated tokens—can serve Llama2-70B on a single RTX-4090 with an average time between tokens (TBT) as low as 0.57s, which is 8X faster than a highly optimized offloading serving system, and 9X faster than DeepSpeed-Zero-Inference. In the on-chip setting, Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to 4.04x, 3.73x, and 2.27x, respectively.

Inference Speed with Sequoia

Offloading Results

GPU	CPU-> GPU Bandwidth (GB/sec)	Target Model	Draft Model	Sequoia TBT (sec/token)	Baseline TBT (sec/token)	Speedup
RTX 4090	31.5	Llama2-70B	Llama2-7B	0.57	4.54	7.96x
RTX 4090	31.5	Vicuna-33B	TinyVicuna-1B	0.35	1.78	5.09x
RTX 4090	31.5	Llama2-22B	TinyLlama-1.1B	0.17	0.95	5.59x
RTX 4090	31.5	Llama2-13B	TinyLlama-1.1B	0.09	0.27	3.00x
2080Ti	15.8	Vicuna-33B	TinyVicuna-1B	0.87	4.81	5.53x
2080Ti	15.8	Llama2-22B	TinyLlama-1.1B	0.53	3.04	5.74x
2080Ti	15.8	Llama2-13B	TinyLlama-1.1B	0.34	1.53	4.50x

On-chip Results

GPU	HBM->SRAM Bandwidth (GB/sec)	Target Model	Draft Model	Sequoia TBT (ms/token)	Baseline TBT (ms/token)	Speedup
A100	1,935	Llama2-7B	JackFram-68M	6.0	24.2	4.04x
A100	1,935	Llama2-7B	JackFram-68M	7.6	24.2	3.18x
A100	1,935	Llama2-13B	JackFram-68M	8.4	31.2	3.73x
A100	1,935	Llama2-13B	JackFram-68M	9.8	31.2	3.19x
A100	1,935	Vicuna-33B	ShearedLlama-1.3B	23.4	53.2	2.27x
A100	1,935	Vicuna-33B	ShearedLlama-1.3B	24.3	53.2	2.19x

Sequoia can speed up LLM inference for a variety of model sizes and types of hardware. We evaluate Sequoia with LLMs of various sizes (Llama2-70B-chat, Vicuna-33B, Llama2-22B, Llama2-13B, and Llama2-7B), in both the offloading (on RTX 4090 and 2080Ti GPUs) and on-chip (A100) settings. We prompt with MT-Bench for the offloading setting, and the C4 validation set for on-chip. The evaluation results are listed above.

Here we show a demo for Llama2-70B inference on a single RTX-4090 (with and without Sequoia. Video plays at 4X speed).‍

**Why Sequoia?**

Sequoia significantly accelerates LLM serving in the offloading and on-chip settings via core improvements to speculative decoding. Firstly, Sequoia scales better with the number of speculated tokens—Sequoia leverages a dynamic programming algorithm to search for the tree structure which maximizes the number of accepted tokens at each budget (i.e. the size of the speculated tree). Secondly, by using sampling without replacement, Sequoia is more robust to different decoding temperatures than top-k sampling and sampling with replacement. Lastly, Sequoia provides a hardware-aware optimizer to select the optimal tree size and depth for each hardware configuration. For further details, please see our paper.

**Left (Scalability)**: Handcrafted tree structures do not perform well at large speculation budgets, whereas *Sequoia* does. **Right (Robustness)**: The acceptance rates of different methods, when 5 options are sampled for the next token. Sampling with replacement (SpecTr) fails when temperature is low and Top-k sampling fails with high temperature. *Sequoia*, leveraging sampling without replacement, attains the highest acceptance rate.

Below we show two examples of tree structures in Sequoia. The left one has 64 nodes which is suitable for on-chip inference, while the right one has 768 nodes, suitable for offloading settings. Our tree construction algorithm allocates more descendents to nodes in previous layers with a higher probability of acceptance.

Conclusion and Future Work

Leveraging Sequoia, anyone can use an RTX 4090 or other consumer (low-cost) GPU to host very strong LLMs like 70B models without approximation, boosting the applications of AI generated content. Sequoia also provides large speedups on high-end GPUs in the small-batch setting, improving the performance of latency-sensitive applications like chatbots.

We believe Sequoia will perform particularly well on future hardware, because its performance scales well with the compute/bandwidth ratio of the hardware, which has been increasing over time (e.g., V100, A100 and H100). Sequoia helps mitigate the bandwidth gaps across the memory hierarchy (SRAM, HBM, RAM, SSD, ...) with smart algorithms, opening new opportunities for AI accelerators design. We are excited to design even faster algorithms for future hardware!

BibTeX

@article{chen2024sequoia,

title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},

author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},

journal={arXiv preprint arXiv:2402.12374},

year={2024}

}

Lower
Cost
20%
faster
training
4x
network
compression
117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Links in this
article

ArXiv: https://arxiv.org/abs/2402.12374

Code: https://github.com/Infini-AI-Lab/Sequoia/

‍