London

June 2–3, 2026

New York

September 15–16, 2026

Berlin

November 9–10, 2026

Is Microsoft’s EngThrive framework immune to Goodhart’s Law?

Making your metrics more gameable.
May 21, 2026

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 5 minutes

Key takeaways:

  • Microsoft developed a productivity measurement system organized around Speed, Ease, Quality, and Thriving.
  • Instead of preventing metric manipulation, EngThrive uses gaming alignment where cheating actually produces beneficial long-term outcomes.
  • To avoid breaking the metrics under Goodhart’s Law, the system assesses overall team health rather than individual developer performance.

A new developer productivity system claims to be game proof by design. A leading researcher on metric failure isn’t so sure.

The British economist Charles Goodhart has reportedly never written a line of code in his life. Yet the 89-year-old economist’s most famous observation – that when a measure becomes a target, it stops being a good measure – has stalked engineering managers for decades. 

Ask people to count up the lines of code they commit, and developers will pad them out. Check pull request volume, and the codebase fills up with trivial commits. Set token-based benchmarks in the new AI-assisted coding world, and tokenmaxxing becomes a thing.

But Microsoft’s research arm believes it has found a metric of engineering productivity that could minimize, if not entirely skirt around, Goodhart’s Law. 

Distinguished scientist Brian Houck and his co-authors unveiled EngThrive, a measurement and improvement system organized around three dimensions (Speed, Ease and Quality), with a fourth, Thriving, acting as a key guardrail. 

  • Speed captures how quickly developers can convert intent into features, and is measured in units of “elapsed calendar time” for things like pull request completion time. 
  • Ease tries to calculate the degree to which developers can work without unnecessary friction or interruption and is measured through “developer time” and surveys on perceived ease of delivery.
  • Quality assesses whether the engineering work produces long-standing and reliable outcomes, and is measured by the frequency and cost of quality-related disruptions.

According to the paper, the system has been deployed across “tens of thousands” of Microsoft developers and is now being used to evaluate everything from AI coding assistants to office HVAC. There’s hope that the framework is flexible enough to work outside of Microsoft.

Unlike previous attempts to capture developer productivity, EngThrive doesn’t try to make its metrics immune to gaming. It tries to make any gaming ultimately useful.

“What we did is try to choose metrics where it’s, like, ‘Great, try to cheat the system. That’s a good thing’,” says Houck, who recently left Microsoft to join Atlassian as a distinguished scientist. The principle, which the paper calls “gaming alignment”, is to pick metrics where the act of fiddling the numbers produces the outcome you actually wanted.

The first-PR experiment

The most striking example concerns Time-to-First-PR – the time between a new hire joining and merging their first commit. On paper, it’s a textbook bad metric, trivially gameable by assigning a one-line code change on day one.

But that’s no bad thing under EngThrive. Microsoft managers overseeing around 4,000 developers were told to assign trivial first pull requests like fixing a typo in a code comment or flipping a test case variable, provided the work sat inside the team’s real workflow, rather than a sandbox environment.

The speed of Time-to-First-PR increased roughly 30%. But alongside that, new hires in the experiment went on to commit 23% more pull requests over their first year than the control group. 

“The act of gaming led to positive long-term outcomes,” says Houck. When he interviewed participants afterwards, they explained why: the first pull request, however small, forced them to set up their development environment, learn the team’s review conventions and master the language of daily stand-ups. “Up until I completed my first code check-in, I couldn’t participate in daily stand-ups because I didn’t know what anyone was talking about,” Houck recalls one new hire telling him.

Good-hearted disagreement

“People often misunderstand what Goodhart’s Law is saying,” says David Manheim, founder and head of research and policy at the Association for Long Term Existence and Resilience (ALTER), and a visiting lecturer at the Technion – Israel Institute of Technology, whose academic work on the failure modes of metric systems is widely cited in the field. 

“Optimizing your metrics optimizes your metrics, not your goals. To the extent that your goals and your metrics are different, that puts pressure on the metrics, not the goals.”

That doesn’t necessarily mean optimization harms outcomes. “It’s often the case that optimizing a metric that is moderately correlated with your goal actually helps a bunch, if you don’t do it too much,” Manheim says. The trouble starts when organizations push too hard.

Raw PR throughput is one metric that historically has been a classic source of gaming. “What’s the easiest way to game PR throughput?” asks Houck. “Break up your code check-ins into a bunch of small little PRs. Turns out that’s a good thing. They’re easier to review. They’re easier to test.”

Manheim is impressed, though not entirely convinced. “Any single metric can be gamed. It’s true any group of metrics can also be gamed,” he says. “The advantage of what they’re doing isn’t that they didn’t fall into the most stupidly obvious trap. It’s that they actually paid attention to features developers should care about.”

That includes things most engineering organizations don’t measure ordinarily. EngThrive includes a “Bad Developer Days” composite, which counts days lost to context switching, build failures, incident response and compliance overhead, alongside a more conventional Net Satisfaction reading.

Some 52% of all developer days at Microsoft qualified as “bad”, according to the study, and developers experiencing three or more bad days a week were three times more likely to quit. “They measure net satisfaction, which most places don’t,” says Manheim, crediting the work.

The pressure problem

EngThrive is designed to assess organizations and teams, but not individual engineers. “We never use these metrics for assessing individual-level performance,” says Houck. “It isn’t, and should never be, about holding individual developers accountable. This is really about holding engineering leaders accountable.” 

That’s something Manheim says is laudable about the way EngThrive is designed. “Lines of code get turned into a performance review item, and sometimes a reason for people not to get promoted or not get bonuses – and suddenly there’s a reason for people to use lots of line breaks.”

The lesson, he says, is to keep metrics on a dashboard rather than linking them to pay decisions. “Using this as a dashboard to pay attention to what’s happening is not putting tremendous optimization pressure,” he says. “If your metric is at all fragile, pushing too hard can break it.”

Because of that, Manheim reckons that EngThrive isn’t totally inured to Goodhart’s Law. But he reckons that it’s a laudable start – and a way for engineering managers to better improve their metrics with as minimal a risk of them being gamed as possible in comparison to other routes. 

“Do better things exist? Obviously, there are things that could be improved,” says Manheim. “But you’re not going to build the best system right out of the gate. This is after decades of iterating on, ‘Here are all the ways things go wrong.’”

LDX3 New York lineup