This is your last article that you can read this month before you need to register a free LeadDev.com account.
Estimated reading time: 6 minutes
Key takeaways:
- Token usage is the lines-of-code metric of the AI era – easy to measure, easy to game, and disconnected from actual productivity.
- The best frameworks track cognitive delegation, not consumption.
- Self-reporting works, but only where trust exists.
Some engineering organizations are measuring AI output via tokens burned, some compare engineers to executive chefs, and some rely on self-reporting.
When Meta’s internal tokenmaxxing leaderboard – ranking engineers by how many AI tokens they consumed – became public knowledge, it didn’t take long for the engineering community to react. The leaderboard has since been taken down, but high token usage is not a badge of honor just at Meta.
Across large organizations, there is enormous pressure to prove that the millions being spent on AI tooling are paying off.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
The easiest metric to game
Tokens burned has become the easiest number to point to. It’s objective, it’s automated, it scales across thousands of engineers, and it gives leadership a dashboard. It’s easy to measure. It’s also easy to game.
“Measuring AI adoption and productivity gains is hard, and especially if you have to do it at an individual level. Token usage definitely isn’t the right metric. You could just use tokens to run your OpenClaw!” says Ankit Jain, founder of the Hangar, a community of senior engineers and engineering leaders focused on developer experience and solving productivity challenges at scale.
Tokenmaxxing is going back to the pre-DevOps Research and Assessment (DORA) and to ‘measuring lines of code’ era, he adds. DORA metrics are a set of five software delivery performance metrics that provide an effective way of measuring the outcomes of the software delivery process.
However, DORA metrics were never designed to evaluate individual developer output. They work as multiple metrics, not just one, Jain says. If you game one metric, it’s going to break down other metrics.
When organizational trust is low, managers reach for data to justify headcount and budget decisions, and individual productivity measurement follows. That’s how DORA gets misapplied as a personal scorecard, and the same dynamic is now producing tokenmaxxing leaderboards.
“Organizations need to figure out how much outcome AI tools are driving, and that’s a hard problem to measure,” Jain says. The solution, he argues, is combined metrics: “We have to come up with a set of metrics versus just one. Tokens used, though, has to be one of them.”
From line cook to executive chef: a skill-level-based framework
One of the more interesting attempts to move beyond token counting comes from a staff engineer at a FAANG company we spoke to, who has been building an AI adoption measurement framework at their organization.
The framework is inspired by Steve Yegge’s ‘executive chef’ model that compares software engineers to chefs in a restaurant kitchen. AI agents are engineers’ sous chefs, line cooks, and prep cooks. Just like executive chefs don’t do all the chopping themselves but decide what goes on the menu, develop the recipes, and taste everything before it’s served, engineers orchestrate agents and own the outcome.
They have adapted Yegge’s original nine levels to four, ranging from basic interactive tool use to orchestrating multiple autonomous agents. The levels do not measure which tools engineers use or how many tokens they burn, but rather the degree of cognitive work being delegated to AI over time, and how effectively.
“A developer at level one is using AI for quick one-shot queries. At level four, they’re writing detailed specifications, orchestrating multiple agents, and shifting quality assurance left toward spec review rather than code review,” they say.
The framework also revealed a counterintuitive proxy signal: as engineers progress through the levels, they file fewer bugs because they resolve issues inline rather than queue them. A declining bug backlog growth rate becomes a lagging indicator of genuine AI maturity – not something you’d normally think to instrument for, but meaningful once you know to look.
The data also surfaced a bigger problem: the level progression is not linear or obvious. Engineers at level one can’t easily imagine level four. There is a real adoption challenge that token dashboards detect but can’t diagnose: engineer resistance.
“There is a genuine pushback of ‘I don’t want my job to be an orchestrator of agents,’ which is an identity concern, not a capability one, and a much harder thing to measure or address. Any honest framework for AI adoption has to account for where an organization sits on that curve,” they say.
More like this
The opposite approach: self-reporting and trust
Emily Nakashima, SVP of engineering at Honeycomb, says there is a lot of fear among people at software companies (not just among engineers) when the question of measuring AI impact and productivity is raised. People worry about what that means for their jobs.
“We did a top-down founder memo last summer, saying that we really believe in this new technology, we really want people to spend time experimenting and learning about it, and that we should all try to 2x our impact with AI over the next year. The first question we got from engineers was ‘how are we going to measure this?’” she says.
Rather than building formal frameworks, Nakashima says they deliberately de-emphasized measurement, relying instead on self-reporting.
“I really worry about companies trying to 2x or 3x their token spend, because there are ways engineers can do that that return no value to the company. We actually get a lot of value out of self-reporting. For that to work well, you have to have a relatively high-trust organization. What engineers on my team give back in terms of self-reporting actually aligns pretty well with what’s seen in their work. When it’s working, you can see it, and these measurement questions go away a little bit.”
Going forward, Nakashima says she’d like to have an individual-level self-report and a manager-level self-report for the team on how much they have increased their impact with AI. She guesses that would be one of the most accurate or most valuable measurements. Honeycomb’s self-reporting approach requires a level of organizational trust that many large companies don’t have.
Hard to measure, but it has to be measured
When trust is low, managers lean on hard numbers, and that’s how we end up with leaderboards. The alternative of not measuring at all isn’t viable either. Organizations are spending millions on AI tooling and need to know whether it’s working. Engineering leaders need signals to know where to invest in training, which teams need support, and where the productivity gains are happening.
Before organizations can measure whether AI is making engineers more effective and where the skill gaps are, managers need to know whether they’re using it at all. Token usage is a good proxy for AI adoption, but it’s a terrible metric for AI productivity.
Angie Jones, former VP of engineering, AI tools, and enablement at Block, confirms that measuring token usage was helpful when they were early in their journey and she needed to track adoption.
“After adoption was clear, I threw it out as it did nothing to measure developer productivity. I wouldn’t be surprised if we see the opposite trend next year, aiming for efficient usage of tokens as opposed to celebrating burning them at expensive rates.”

New York • September 15-16, 2026
Speakers Camille Fournier, Gergely Orosz and Will Larson confirmed 🙌
The tokenmaxxing backlash
The backlash to Meta’s tokenmaxxing leaderboard shows that the industry is aware that celebrating high token usage is not the right metric, especially as AI tooling costs continue to grow.
However, there is no consensus yet on what to measure instead. Jain suggests combining token usage with lines of code shipped and says some AI tools already provide data points around how much code was accepted. “Although ‘all the code was accepted’ is a very fuzzy measurement,” Jain admits.
That fuzziness is the state of things. The tooling to connect token spend to shipping outcomes doesn’t fully exist yet. The signals are noisy, and every framework requires tradeoffs between rigor and trust.
If Jones is right that the next phase will be about efficient token usage rather than maximum token usage, the leaderboards may not disappear – they’ll just measure something worth competing over.