The Complete Guide to Needle: How Cactus Distilled Gemini Tool Calling into a 26M Model

The Complete Guide to Needle: How Cactus Distilled Gemini Tool Calling into a 26M Model
Affiliate disclosure: This article contains affiliate links. If you click and purchase through one, we may earn a small commission at no additional cost to you.

AI assistance: Drafted with AI assistance and edited by Auburn AI editorial.






The Complete Guide to Needle: How Cactus Distilled Gemini Tool Calling into a 26M Model


The Complete Guide to Needle: How Cactus Distilled Gemini Tool Calling into a 26M Model

In May 2026, a small team at Cactus did something that should have seemed impossible: they took the function-calling capabilities that Google spent massive compute budgets perfecting in Gemini and squeezed them into a 26-million parameter model that runs on consumer phones at 6000 tokens per second. The model, called Needle, is now open source under the MIT license, available on GitHub and Hugging Face.

This matters because it exposes a fundamental inefficiency in how we’ve been building AI agents. The conventional wisdom says agentic systems need large models with reasoning capability. Needle suggests otherwise. What surprised us when researching this was how cleanly the architecture separates tool calling from general reasoning—and how much compute we’ve been wasting on the wrong problem.

What Happened: Show Needle Distilled Gemini’s Architecture Into Something Radically Smaller

Henry and the team at Cactus conducted an investigation into why so few agentic models were optimized for budget phones, smartwatches, and edge devices. Their finding: the entire premise was wrong. Agentic experiences, they discovered, are fundamentally built on tool calling—matching a user query to the right function, extracting argument values, and emitting structured JSON. That’s not reasoning. That’s retrieval and assembly.

This observation led to a radical architectural choice. Instead of building another transformer with the standard mix of attention layers and feed-forward networks (FFN), they asked: what if we removed the FFN entirely?

Needle is built exclusively from attention layers and gating mechanisms. No feed-forward networks. No parameter waste on memorizing facts that can be provided in the input context. The model was pretrained on 200 billion tokens across 16 TPU v6e processors (27 hours of compute), then post-trained on 2 billion tokens of synthesized function-calling data in just 45 minutes.

The training data itself is interesting. Rather than manually curating tool-calling examples, Cactus used Google’s Gemini to synthesize the entire dataset across 15 tool categories: timers, messaging, navigation, smart home controls, and others. This meta-approach—using a large model to create training data for a small model—reflects a broader trend in efficient AI development.

Performance metrics matter here. Needle achieves 6000 tokens per second during prefill (processing the input) and 1200 tokens per second during decode (generating the output) on consumer devices. To put that in context, that’s fast enough for real-time interaction on a mid-range Android phone or an Apple Watch. The model outperforms Google’s own FunctionGemma-270M, Qwen-0.6B, and several other specialized function-calling baselines on single-shot tool invocation tasks.

The team is transparent about scope limitations. Needle excels at single-shot function calling—one query, one tool invocation, done. Models like FunctionGemma and Qwen maintain advantages in multi-turn conversations and broader reasoning tasks. Needle is purpose-built, not general-purpose. That’s precisely the point.

Why This Matters: The Economics of Edge AI Just Shifted

The implications reach beyond technical novelty. Needle represents a significant efficiency improvement in how AI agents can be deployed to billions of devices that currently lack any agentic capability.

Consider the practical constraints. A 26-million parameter model requires roughly 52 megabytes of storage in float16 precision. A smartphone with 4GB of RAM can load this comfortably alongside a browser and messaging app. A smartwatch with 512MB can run it. The inference speed—6000 tokens per second on prefill—means a user query gets processed in tens of milliseconds, not seconds. This changes what’s possible on-device without cloud calls.

The energy profile matters too. Smaller models consume less power. On a phone battery, the difference between a 26M and a 7B parameter model isn’t marginal—it’s the difference between viable and impractical. For wearables, it’s the difference between usable and not.

From an industry perspective, this challenges the scaling laws narrative that has dominated AI discourse. The conventional wisdom says bigger is always better—more parameters, more data, more compute. Needle suggests that for specific, well-defined tasks like tool calling, the relationship breaks down. You can be smarter by being smaller, if you’re purpose-built.

There’s also a privacy angle. If function calling can happen entirely on-device without cloud connectivity, that’s a meaningful privacy improvement over systems that require sending every query to a remote API. Users get faster responses and data stays local.

The open-source release under MIT license removes licensing friction for commercial use. Companies can integrate Needle into products without negotiating with Google or paying API fees. That has real business implications for startups and established companies alike.

How It Works: The Architecture Behind Needle’s Efficiency

The technical innovation is straightforward once you understand the premise. Needle uses what the authors call Simple Attention Networks—the entire model is attention and gating, no MLPs (multi-layer perceptrons, the feed-forward networks in standard transformers).

Traditional transformers alternate between attention layers (which allow the model to look at different parts of the input) and FFN layers (which apply non-linear transformations to each token independently). The FFN layers contain roughly 66% of a transformer’s parameters. In a 26M model, removing them entirely saves millions of parameters while maintaining the capability that matters for tool calling.

Why does this work? Cross-attention is the right primitive for function calling. When the model sees a query like “set a timer for 10 minutes,” it needs to match that to a timer tool and extract the duration argument. That’s fundamentally a matching and retrieval problem. The query attends to the available tools, picks the right one, and extracts values. Attention handles this naturally. FFN layers, which are designed for abstract reasoning and memorization, are overkill.

The gating mechanism (likely inspired by recent work on gated attention variants) allows the model to selectively route information. Early layers might focus on understanding the query. Later layers might focus on matching to tools and formatting output. The gating learns which information is relevant for each task.

This architecture generalizes beyond tool calling. The team notes that the “no FFN” finding applies to any task where the model has access to external structured knowledge: retrieval-augmented generation (RAG), knowledge base lookup, structured data extraction. The model doesn’t need to memorize facts in its weights if those facts are provided in the input context. That’s a significant insight with implications far beyond Needle itself.

Training efficiency also reflects the architectural choices. Post-training on 2 billion synthesized examples took 45 minutes on modern hardware. That’s dramatically faster than training large models, which typically requires weeks or months. The synthesis process used Gemini to generate diverse tool-calling scenarios, ensuring coverage across the 15 tool categories.

Expert Reactions and Industry Context

The release has resonated within the open-source AI community, particularly among developers focused on mobile and edge deployment. The GitHub repository has attracted attention from researchers exploring efficient agentic systems and from practitioners building real products on constrained devices.

What’s notable is how Needle fits into a broader pattern. Over the past 18 months, we’ve seen increasing focus on task-specific models rather than general-purpose giants. Mistral’s move toward smaller, specialized models. Apple’s investment in on-device ML. Meta’s release of Llama variants optimized for mobile. Needle accelerates this trend, but with a specific insight: for tool calling, you can be radically smaller.

The comparison benchmarks matter. Needle outperforms FunctionGemma-270M (10x larger), Qwen-0.6B (23x larger), Granite-350M (13x larger), and LFM2.5-350M (13x larger) on single-shot function calling. Those are meaningful wins. The caveats are important too—those larger models excel in conversational multi-turn settings where Needle’s single-shot design is a limitation. There’s no free lunch, just better tradeoffs for specific use cases.

Cactus frames Needle as part of their broader work on Cactus, an inference engine built from scratch for mobile, wearables, and custom hardware. The inference engine is the other half of the equation. A small model is only useful if you have an efficient runtime to execute it. Cactus appears to be building that stack end-to-end.

What Comes Next: The Implications for On-Device AI

The immediate next steps are clear from the team’s roadmap. They’re encouraging developers to test Needle on custom tool sets via the provided playground and finetune the model for specific use cases. The MIT license and open weights make this practical.

Longer term, there are several interesting directions. The “no FFN” architecture might extend to other specialized tasks beyond tool calling. The team mentions experimental results coming for RAG and retrieval-augmented generation, suggesting they’re already exploring this. If that pans out, you could see a family of efficient models optimized for different agentic subtasks.

There’s also the question of scaling. Needle is 26M parameters. What happens at 50M, 100M, or 200M with the same architectural principles? Does the efficiency advantage hold? Does performance improve linearly or are there diminishing returns? These are open questions the community will likely explore.

The practical deployment story matters too. As more developers integrate Needle into products, real-world performance data will emerge. How does it perform on diverse tool sets? How well does it generalize to tools it wasn’t trained on? Does finetuning on custom tools significantly improve performance? These questions will determine whether Needle becomes a standard component in mobile AI stacks or remains a specialized tool for specific use cases.

There’s also potential for competitive response. Google has the resources to build even more efficient function-calling models. Apple might integrate similar capabilities into iOS. The open-source release puts pressure on the entire industry to be more thoughtful about model efficiency and task-specificity rather than defaulting to larger, more general models.

FAQ

Conclusion

Needle demonstrates that the path to efficient AI isn’t always about scaling down general-purpose models. Sometimes it’s about rethinking the problem entirely. By recognizing that tool calling is fundamentally different from reasoning, Cactus built something that shouldn’t exist according to conventional wisdom: a 26-million parameter model that outperforms much larger specialized baselines.

The open-source release removes barriers to adoption. Developers can now integrate agentic capabilities into mobile apps, wearables, and edge devices without cloud dependencies or API costs. The architecture suggests a broader principle: for any task with access to external structured knowledge, you might not need the general reasoning capacity of large models. That principle will likely shape how AI systems are built over the next few years.

The real test comes next—whether developers adopt Needle, finetune it for real-world use cases, and prove that this approach scales to production systems. If they do, this becomes the template for efficient agentic AI.

– Auburn AI editorial



Affiliate Disclosure & Disclaimer: This post may contain affiliate links. If you click a link and make a purchase, we may earn a small commission at no additional cost to you. We only recommend products and services we genuinely believe add value. All opinions expressed are our own. Product prices and availability may vary. This content is provided for informational purposes only and does not constitute professional advice. Always conduct your own research before making purchasing decisions.

Related Auburn AI Products

Building a tech content site? Auburn AI has production kits:

For general informational purposes only; not professional advice. Posts may contain affiliate links. Learn more.
Scroll to Top