Research

HELM: benchmarking large language models on the Together Research Computer

November 17, 2022

・

Together

Together’s software network harnessed spare GPU cycles across thousands of servers to benchmark 10 prominent open language models and process 11 billion tokens.

We have entered the era of foundation models — massive models trained on huge amounts of data — which can be adapted to a wide range of applications. Language models such as GPT-3 in particular have rich capabilities. They can improve the quality of existing applications (e.g., question answering) and introduce novel applications (e.g., brainstorming slogans or writing blog posts).

The pace of innovation is rapid, with new models being announced regularly. But given the vast surface of capabilities of language models, it’s not clear when they work and when they don’t, which is critical for downstream users. The Stanford Center for Research on Foundation Models (CRFM) has recently announced Holistic Evaluation of Language Models (HELM), a comprehensive effort to benchmark 30 language models. 10 of these models are open, with their model weights released as open-source.

The HELM language modeling benchmark features open models from a number of organizations.

A core challenge in building HELM is computation. HELM covers 42 scenarios from question answering to sentiment analysis. Running HELM on the 10 open models requires running inference over more than 11 billion input tokens and 1.6 billion output tokens. These experiments required over 60,000 GPU hours for all tests and tuning, with the final full run of benchmarks requiring 20,000 GPU hours. As new open models are released (e.g., Flan-T5, BLOOMZ, Galatica) and the number of scenarios grow, the maintainance and expansion of the benchmark will create a growing demand for computation.

Available GPU hours per day over a month.

Instead of relying on dedicated cloud services, HELM is enabled by a different source of computing, one enabled by the research community. More specifically, these are “idle” GPU cycles in academic clusters. From our experience on Open Science Grid and three academic clusters (Stanford, ETH Zurich, and the University of Wisconsin-Madison), there are more than 2,000 GPU cards in total, and it is not uncommon to see hundreds of these cards idle for several hours a day (as long as it’s not right before a major conference deadline!). How can we aggregate all these computation resources, not only to support HELM, but to enable researchers to tackle more ambitious projects?

Together is building systems to enable this kind of efficient, shared compute for AI. Together’s mission is to bring the world’s compute together to enable everyone to contribute to and benefit from advanced AI. We have roots in academic research and are committed to the growth in open research and advancing the state of open AI models.

As an early step towards this mission, we started by aggregating academic compute into the Together Research Computer. The Together Research Computer today combines the resources from five partners (Stanford, ETH Zurich, Open Science Grid, University of Wisconsin-Madison, and CrusoeCloud) with a total of more than 200 PetaFLOPS worth of compute. All 10 open models in HELM are run on the Together Research Computer.

Decentralized computing can solve the computation bottlenecks in LLMs, and make the field more accessible to researchers and practitioners.

Together Research Computer today connects Stanford, ETH Zurich, Open Science Grid, University of Wisconsin-Madison, and CrusoeCloud. We hope this network will continue to grow, please talk to us!

There are a range of technical challenges in aggregating decentralized compute for using large models. To support HELM’s benchmarking, we specifically optimized for the high-throughput setting, where we seek to maximize throughput (# tokens/seconds) instead of the latency of a single inference task. The HTCondor project and the Center of High-Throughput Computing at UW-Madison, who pioneered this concept was a source of inspiration here.

First, to support the largest models (a 175 billion parameter model requires 350GB of storage), we have to partition the model onto different devices — already a challenging task with dedicated machines in the cloud. In our decentralized setting, this becomes even harder. Machines and networks can be vastly heterogeneous, and to best utilize all available devices, we need to carefully partition a large model and map different parts of the computation tasks to different devices. In [https://arxiv.org/abs/2206.01288], we describe some of our research on resource allocation for training under heterogeneous networks, and we follow a similar approach to handle inference on heterogeneous devices. The pioneering work by HT-Condor project has been an inspiration for us in the design of high-throughput AI.

Second, we need to gracefully handle preemption; after all, we harvest GPU cycles only when no one is using these machines and will be evicted the moment these machines are required by other users. Especially when models are partitioned among different devices, a part of the model might be evicted while another part is still running, so handling preemption requires care. We periodically checkpoint and synchronize the inference results to a global file system and designed a communication-efficient way of detecting evictions.

We are excited by what the Together Research Computer, an early application of our distributed computing technology, has enabled so far and plan to continue supporting HELM as well as other benchmark efforts such as DataPerf [https://dataperf.org/] in collaboration with MLCommons [https://mlcommons.org/en/]. Benchmarking is just the first step; we also want to support fine-tuning and pre-training of foundation models.

Together is committed to building a thriving ecosystem for open foundation models and mechanisms to aggregate compute at a very large scale. We believe better open models will create widespread value and create equitable access to this important technology. To get there, we want to involve the AI communities in research and industry. We are looking for partners who have idle compute to share, researchers who are interested in decentralized training or just trying out ideas on the Together Research Computer. Please talk to us if you are also excited by this vision. Let’s build the best open models in the world, together!

‍

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item #2

List Item #3

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.

Funding: ＄5M-$10M

Scale

Benefits included:

✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.

Funding: ＄10M-＄25M

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

‍Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

A. 2.08*1e-1 m
B. 2.08*1e-9 m
C. 2.08*1e-6 m
D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

A. releasing nitrogen in the soil.
B. crowding out non-native species.
C. adding carbon dioxide to the atmosphere.
D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Links in this
article

HELM: benchmarking large language models on the Together Research Computer

Together’s software network harnessed spare GPU cycles across thousands of servers to benchmark 10 prominent open language models and process 11 billion tokens.

Subscribe to newsletter