Same Token Budget, Double the Performance: Anthropic’s Counterintuitive Agent Experiment
Anthropic's platform team found that under a fixed token budget, assigning different roles (execution, suggestion, scoring, reflection) to tokens can boost accuracy from 42% to 75%. This isn't magic—it's engineering.
Anthropic's platform team recently put forward a counterintuitive argument: not all tokens are created equal.
Everyone assumes the leverage is fixed—more tokens yield better results. But they asked: what if we assign specific roles to different tokens?
So they split the task up. Some tokens are responsible for execution, some provide suggestions to the executor, some score outputs based on predefined criteria, and another set does "reflection"—reviewing past runs, documenting lessons learned for future use.
The key test: keep the token budget fixed across all strategies.
If all tokens were interchangeable, every strategy should get roughly the same score. But that's not what happened. On a financial analysis benchmark, the pure execution strategy reached the correct answer 42% of the time—while the smarter, role-split strategy hit 75% at its best.
The cost impact is huge: to brute-force a perfect answer, the pure execution strategy consumes around 1.8 million tokens. The "suggest + score" strategy uses only a tiny fraction of that—same total budget, just far better division of labor.
Anthropic's talk is around 15 minutes long and free to watch. Core takeaway: **How you spend your tokens matters far more than how many tokens you spend.**
---
This line of thinking has already been directly implemented into products. Anthropic just launched Claude Tag: when you @ it in a Slack channel, it works just like a team member—takes initiative, remembers everything, and can run autonomously for days.
It's not the old "open the app and ask a question" workflow. You add it to your channel, it jumps in on its own, follows up on stalled threads, and remembers everything you've ever told it—permanently.
One engineer has had a single Claude Tag session running continuously for a month. It checks data every day, automatically submits PRs to fix bugs, and outputs a daily report every morning.
It's built for multi-person collaboration: the whole team works with the same Claude instance, no more copying and pasting outputs between team members. In Anthropic's own product team, around 65% of all PRs are now written by Claude Tag—including most of the code used to build Tag itself.
Andrej Karpathy calls this the third major shift in LLM user experience: it's no longer "you go talk to AI"—it's "the AI is already in the room with you."
---
Dario Amodei, CEO of Anthropic, mentioned in another talk that across many of the company's teams, around 90% of all code is now written by AI.
Everyone assumes this means 90% of engineers will be laid off. But he says the opposite is true.
Following the principle of comparative advantage, engineers no longer spend time writing routine code—they shift their focus to the hardest 10% of work: editing, oversight, and decision-making.
Linking these three points together, Anthropic's core logic is very clear: split tokens into specialized roles, embed AI directly into existing workflows, help engineers level up their work. It's not replacement—it's restructuring.
发布时间: 2026-07-05 13:25