Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

London

June 16 & 17, 2025

AI models can’t understand code. Does that matter?

New research highlights how little large language models understand about the code they are churning out. Should we care?
April 25, 2024

You have 1 article left to read this month before you need to register a free LeadDev.com account.

As developers find new ways to get more done by using generative AI-powered coding assistants, a recent study has found that large language models (LLMs) are more parrot than 10x developer when it comes to the relatively simple task of summarizing code.

“I’ve talked to a lot of people in the area and their instinct is that you can ask these language models to do any task for you, and they’ll do it,” says Rajarshi Haldar, a researcher at the University of Illinois Urbana-Champaign and co-author of the paper. “It’s great how often they do work, but you also have to know when they don’t work. And you also have to apply safeguards to make sure that they stay on course and actually get the results you want.”

June 2023 McKinsey survey found the time taken to complete code documentation tasks dropped between 45-50% when devs had a helping hand. The time taken to generate new code was 35-45% faster with generative AI’s help, while refactoring code could be done between 20-30% quicker thanks to tools like GitHub Copilot.

The problem is that those productivity gains aren’t distributed equally. McKinsey found that developers with less than a year’s experience in the industry saw a less than 10% benefit from using AI tools; with some tasks taking junior devs even longer when they used generative AI than if they had done it themselves.

It seems that understanding the code you are working with unlocks these productivity gains, but using these tools without that baseline knowledge could be counterproductive, or even dangerous.

Putting LLMs to the test

Haldar and his co-author, Julia Hockenmaier, a professor in natural language processing at the University of Illinois Urbana-Champaign, wanted to test how LLMs – including PaLM 2 and Llama 2 – could understand the semantic relationship between natural language and code produced. They chose to investigate whether the models relied on the similarities between the tokens (the sections of text which models are trained on, which could be parts of or full words) in their training data and the code they were asked to produce – essentially, were they just parroting the code already in their training data?

“We’re really interested in understanding how much these models actually understand,” says Hockenmeier. To do that, they altered code examples by removing or altering function names and the structure of the code, then seeing whether the LLM was able to interpret and summarize the code. “What our results indicate is that these models don’t really understand the underlying logic of the code,” she says. Meaning they’re unable to solve real issues.

The test was done on whether LLMs could understand and summarize code given to them. “‘Explain what the code does’ versus ‘write code for this problem’ are almost but not quite inverse problems,” says Eerke Boiten, professor in cybersecurity at De Montfort University. “In the second case, the description of the problem would tend to be a bit more abstract than the description of its solution.”

Stay vigilant

But if LLMs did exhibit the same issues when writing code, that might be a concern – not least because of the old maxim around AI outputs of “garbage in, garbage out”. LLMs are trained on giant trawls of content from the web, which could include platforms like Stack Overflow, which often include coders asking for advice on how to fix or patch up their non-working code. 

If an AI system is unable to understand that some snippets of code it encounters are not correct and don’t work – and to discount them out of hand – but instead includes them in its own training data-built knowledge, that could prove disastrous as the model mindlessly reproduces bad code.

Large language models have never professed to have an actual understanding of what they see, although that is due to change with the mooted releases of new models that are rumored to have some form of understanding.

Yet LLMs are still seen as a powerful potential tool for developers to deploy in order to speed up their work – and increasingly computer science students are graduating after being taught how to work with LLMs.

That is fine if the people are aware of the drawbacks of the systems, and are hyperconscious of how to look out for any errors that occur. For the time being, developers ought to beware of the tools they’re using – and skeptical of the output it produces. “Models are so large, and have been trained on so much data, that the output is always going to look plausible,” Hockenmaier says. “But whether it’s actually correct, it’s probably going to depend very much on the complexity of the function.”