Research

Minions: embracing small LMs, shifting compute on-device, and cutting cloud costs in the process

February 25, 2025

By 

Avanika Narayan*, Dan Biderman*, Sabri Eyuboglu*, Avner May, Scott Linderman, James Zou, Christopher Ré

Summary

We present Minions, a method that shifts a substantial portion of LLM workloads to consumer devices by having small on-device models collaborate with frontier models in the cloud. By only reading long contexts locally, we reduce cloud costs with minimal or no quality degradation. We imagine a future where an “intelligence layer” running persistently on-device interacts with frontier models in the cloud to deliver applications with cost-effective, “always-on” intelligence.

The Tailwinds: Frontier Models Cost $$$, While Local Hardware is Collecting Dust

Compute is everywhere—our laptops, phones, cars, and even eyeglasses (shoutout to Chris, and no, not our advisor ❤️) — and the amount of compute on these consumer devices seems to be growing rapidly (e.g. Nvidia just announced a desktop with Blackwell GPUs). Thanks to incredible packages like Ollama 😊 (along with many others), consumer devices can now run surprisingly capable small LMs. While we’re calling them “small”, remember that we’re talking about models that have billions of parameters and can tackle real tasks. And they are improving rapidly.

With models of that caliber on-device, we can envision a world where laptops host always-on coding assistants that tirelessly refactor our code, local agents that process documents seamlessly, and smart security systems that detect anomalies before they escalate. We’re getting there—but not quite yet. Today’s local LMs handle more humble tasks, like auto-complete, text summarization, and small-scale RAG pipelines. Any moderately challenging task is handled by frontier models on the cloud. However, these models are not cheap. Consider our code refactoring dream; asking OpenAI’s o1 a single question on a respectable code repository (with one million tokens) will cost us >$15¹!!!

¹ o1 current rate is $15.00 / 1M tokens.

The Big Question: Can Local LMs Step Up?

Small LMs are improving rapidly. Perhaps even more rapidly than their frontier counterparts. But it’s undeniable that for some higher order/domain-specific reasoning tasks (i.e., medical reasoning) frontier models have a clear advantage. Instead of seeking to replace big models with small ones, we ask: to what extent can these two model tiers collaborate to reduce cloud API costs

To mimic the dream use cases mentioned above – of continuous grunt work on device – we analyze three data-intensive tasks that involve retrieving and integrating data from long domain-specific documents:

  • Financial analysis over 100-page 10-K reports (FinanceBench)
  • Medical reasoning over patient records (LongHealth)
  • Question answering over scientific papers (QAsper)

The Naive Attempt: Minion

We first tried a simple approach: letting the on-device LM, which has access to data, chat freely (see Figure 1 below) with a cloud model—like a junior analyst consulting a senior expert. The models choose which information to share and when to terminate the conversation. Since the long document is not sent to the cloud, we pay only 3.3% of the cloud cost while maintaining 87% of cloud model performance. Not bad at all! But why do we lose 13 points? It seems like performance takes a hit due to two limitations of the current generation of small LMs:

  • Long Context Struggles – As document length increases, small LMs perform worse.
  • Multi-Step Confusion – As instructions become more complex and multipart, small LMs struggle to follow them reliably.

Beyond these challenges, hardware utilization (on consumer GPU like 4090) remained low, leaving both compute and potential accuracy gains on the table (h/t Large Language 🐒’s). A back-and-forth chat does not take advantage of batching on-device, which is crucial for achieving high utilization.

Figure 1: Sample Minion Chat 

The Sophisticated Attempt: Minions

The lessons we learned led us to the enhanced Minions protocol (emphasis on plural “s”). MinionS introduces a decomposition-execution-aggregation loop that addresses the weaknesses of small LMs, while making them work harder (in MinionS, we work harder *and* smarter).

Figure 2: Description of MinionS Protocol

Here is how it works:

  • Decompose – The remote LM decomposes the task into smaller, single-step subtasks to be executed on chunks of the context. But how can it decompose the task and chunk the context without seeing it? To this end, we have the remote LM generate code for task decomposition and chunking, which is executed on-device where the context is available. 
  • Execute – The local LM runs the subtasks in parallel, filtering only the most relevant outputs and communicating them to the remote LM.
  • Aggregate – The remote LM combines the local outputs, finalizing the answer or looping and requesting another round of subtasks.

Minions delivers 97.9% of the accuracy of remote-only solutions while costing just 17.5% as much!

Figure 3: Cost-Accuracy Tradeoff 

Tuning The Minions

Digging deeper, we found key levers that allow us to navigate the cost-accuracy tradeoff:

  • Model Choice Matters – Currently, local models below 3B parameters are not effective for the Minion/Minions protocols. By fixing the remote model and varying the local model (exploring different sizes and generations), we show that the Minions protocol wouldn’t have been feasible until mid-2024 (until the release of gpt4-turbo and llama3.1-8B)!
  • Scaling of inference-time compute – Techniques like repeated sampling, finer-grained decomposition, and context chunking (see Figure 4) at the execution stage boost quality—but at the cost of longer runtimes on laptops and increased remote usage. This is illustrated with the scaling curves in the figure below. 
  • Sequential Communication HelpsAllowing for more communication rounds generally improves accuracy at the cost of compute.
Figure 4: Effects of Scaling Local Compute

To Infinity and Beyond

MinionS is a first step towards establishing a communication protocol between small on-device LMs and frontier LMs. As GPUs become ubiquitous across consumer and edge devices, we need smarter ways to distribute AI workloads—reducing reliance on expensive cloud APIs while unlocking powerful local applications. If you are a practitioner, researcher, or hacker, we’d love to get you involved with Minions:

  • Practitioners: we encourage you to apply MinionS to your own workloads — please see the quickstart in our GitHub repository
  • Researchers: we see significant opportunities for further research, including improving communication efficiency through better co-design of local and remote models, alternative communication signals beyond natural language, and enhancing inference speed via systems and algorithmic advances.
  • Hackers: we are just getting started with MinionS, and are excited to continue building. If you are excited by our vision, we’d love to see PRs and new projects that expand our community!

Additional Resources

  • How do you use MinionS? Check out our GitHub repository for sample scripts.
  • Want to learn more? Read our paper.

Acknowledgments

A huge thank you to Jeff Morgan, Michael Chiang, Patrick Devine, and the entire Ollama team for their support throughout this release. We’re also grateful to Together AI for generously providing compute resources and credits. Lastly, thanks to the Hazy Lab members – Jon Saad-Falcon, Jerry Liu, Jordan Juravsky, Ben Viggiano, Kelly Buchanan and Michael Zhang – for their valuable feedback on this post and throughout our work.

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #3

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

  • ✔ Up to $30K in free platform credits*

  • ✔ 6 hours of free forward-deployed engineering time.

Funding: $5M-$10M

Scale

Benefits included:

  • ✔ Up to $50K in free platform credits*

  • ✔ 10 hours of free forward-deployed engineering time.

Funding: $10M-$25M

Multilinguality

Word limit

Disclaimer

JSON formatting

Uppercase only

Remove commas

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

  • A. 2.08*1e-1 m
  • B. 2.08*1e-9 m
  • C. 2.08*1e-6 m
  • D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

  • A. releasing nitrogen in the soil.
  • B. crowding out non-native species.
  • C. adding carbon dioxide to the atmosphere.
  • D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Start
building
yours
here →