Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Introduction
We introduce Sequoia, a scalable, robust, and hardware-aware speculative decoding framework that improves LLM inference speed on consumer GPUs (with offloading), as well as on high-end GPUs (on-chip), without any approximations. We show below that Sequoia—by creating large trees of speculated tokens—can serve Llama2-70B on a single RTX-4090 with an average time between tokens (TBT) as low as 0.57s, which is 8X faster than a highly optimized offloading serving system, and 9X faster than DeepSpeed-Zero-Inference. In the on-chip setting, Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to 4.04x, 3.73x, and 2.27x, respectively.
Inference Speed with Sequoia
Offloading Results
On-chip Results
Sequoia can speed up LLM inference for a variety of model sizes and types of hardware. We evaluate Sequoia with LLMs of various sizes (Llama2-70B-chat, Vicuna-33B, Llama2-22B, Llama2-13B, and Llama2-7B), in both the offloading (on RTX 4090 and 2080Ti GPUs) and on-chip (A100) settings. We prompt with MT-Bench for the offloading setting, and the C4 validation set for on-chip. The evaluation results are listed above.
Here we show a demo for Llama2-70B inference on a single RTX-4090 (with and without Sequoia. Video plays at 4X speed).
Why Sequoia?
Sequoia significantly accelerates LLM serving in the offloading and on-chip settings via core improvements to speculative decoding. Firstly, Sequoia scales better with the number of speculated tokens—Sequoia leverages a dynamic programming algorithm to search for the tree structure which maximizes the number of accepted tokens at each budget (i.e. the size of the speculated tree). Secondly, by using sampling without replacement, Sequoia is more robust to different decoding temperatures than top-k sampling and sampling with replacement. Lastly, Sequoia provides a hardware-aware optimizer to select the optimal tree size and depth for each hardware configuration. For further details, please see our paper.
Below we show two examples of tree structures in Sequoia. The left one has 64 nodes which is suitable for on-chip inference, while the right one has 768 nodes, suitable for offloading settings. Our tree construction algorithm allocates more descendents to nodes in previous layers with a higher probability of acceptance.
Conclusion and Future Work
Leveraging Sequoia, anyone can use an RTX 4090 or other consumer (low-cost) GPU to host very strong LLMs like 70B models without approximation, boosting the applications of AI generated content. Sequoia also provides large speedups on high-end GPUs in the small-batch setting, improving the performance of latency-sensitive applications like chatbots.
We believe Sequoia will perform particularly well on future hardware, because its performance scales well with the compute/bandwidth ratio of the hardware, which has been increasing over time (e.g., V100, A100 and H100). Sequoia helps mitigate the bandwidth gaps across the memory hierarchy (SRAM, HBM, RAM, SSD, ...) with smart algorithms, opening new opportunities for AI accelerators design. We are excited to design even faster algorithms for future hardware!
BibTeX
@article{chen2024sequoia,
title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
journal={arXiv preprint arXiv:2402.12374},
year={2024}
}
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
article