AI code generation as an agent of tech debt creation

I've disabled all LLM-based AI Assistants/Copilots/whatever-you-call-'ems in my IDE.

Take a moment to let the shock, horror, and betrayal to pass through you (or, potentially, the giddy "OMG I'm not alone!" squeal burbling up from deep inside), because I'd like to expound for a bit about why I've done this, and I know the title is a bit of a giveaway, but no, it's not because I'm some cranky old man who can't stand that new-fangled ding-dang technology. Well, I mean I am old and cranky, but what I mean is I work at a company that heavily uses AI to augment real, serious professionals' capabilities. I'm not an anti-AI absolutist or anything.

I just think that it's a disaster waiting to happen for the software development industry, is all.

Here's my core thesis:

LLMs will never be as capable as we have been told they will be. All of ginned-up demos that we've been shown around LLM-based tools being given a three-sentence description of a feature and building it and writing the tests and reviewing itself and deploying it, all automatically? They don't work now, and they never will. LLMs can never "know" or "understand" anything. They seem to have interesting emergent properties, but every new announcement about those capabilities quickly falls to pieces in the Hacker News comments, as curious devs give it a try and find that it actually doesn't work quite that well, at least not under all circumstances, and critically, not to achieve the overblown promises we're being sold around feature-scale AI code generation. Hell, this just happened again with the release of OpenAI's o1 series of models, models that we are told "think" but in actuality do nothing of the sort. They mostly just go "hmm" and "ah!" a lot, and then over-charge you via a new secret token mechanism to return dangerously close but not-quite-right code just frequently enough to cause serious problems down the line.

It would appear that we're hitting the upper limits of practical model sizes, too. Larger models have been the source of new and interesting improvements of these models, the "secret sauce" of the AI boom, as it were. And larger models can be trained, but the cost to train them (and retrain them when necessary), plus the cost of then running them looks to be prohibitive. We can see this in the latest ChatGPT models, the 4o series, which are smaller and less expensive than the original ChatGPT4. Even o1 doesn't appear to be a significantly larger model, it's likely just a pretty solid implementation of "chain of thought" prompting with existing models. ChatGPT5 it ain't, but maybe, when and if we get ChatGPT5, it'll will be another significant jump, like 2 to 3 was, or at least 3 to 4. But the basic architecture will almost certainly still be the same, the LLM architecture. We were in an "AI winter" before someone discovered that Big LLM Make AI Go Brr, and we'll be right back in another one once we hit the practical maximum sizes of LLMs, if we haven't already.

Again, I'm not saying that LLMs are fake and that there's a man behind the curtain (except when there absolutely is, although that's not the topic of this essay), or that there's no use for LLMs. There are a variety of uses that make very reasonable sense, even in the field of coding! But using them to generate code is Not It. Absolutely no large-scale feature-generation tool has survived contact with the general public just trying to use the service as described, and I think even the small-scale code generation tools are dangerous in ways we're not really talking about yet. These are not tools that reward time invested into learning them, because they do not turn out the same quality of work. Hell, its arguable that they don't even turn out work of acceptable quality (and that is the side I would, in fact, argue). LLM-generated code of any real-world size and complexity -- you know, not just used to create toy examples to demo the product to credulous CTOs and slavering CFOs -- this code is almost invariably of just miserable quality, lacking any refinement or robustness that comes from a proper understanding of the problem area and the surrounding system, for exactly one critical reason: LLMs do not understand.

There are people who will debate me on this, and they tend to do so with arguments that point to what seem like surprising emergent properties, but these people are like audience members in a magician's act who point at the wriggling feet of the woman who was sawn in half and deny what their logical mind knows to be true -- this is all a trick. The fundamental method that is used to create an LLM -- that is, predictive text based on input weights -- this is not something that can generate an understanding, it is only something that can seem to understand, in certain limited circumstances, for a little while. Every new generation of LLM release has gone through the same cycle of marks crying "Look, finally it understands!" only for actual users to immediately run up against the limits of this supposed "understanding." Definitely, these new models are more impressive, and there are more and more cases where their careful, considered use is looking more reasonable, but none have (or ever will) spontaneously developed the ability to "understand", much like how an apple has never spontaneously developed the ability to harmonize, despite them being able to make your tastebuds sing.

But it's worse than that.

It's not just that people are putting time and effort into learning a tool that won't pay off in the manner the hype-cycle has promised, this is a tool unlike other tools that have come before and promised improved productivity, because in actuality, it makes you less effective over time. You may get better at "prompt engineering" (UGH), but you are sacrificing your knowledge and understanding of the system and tooling that you're working with, like trading a worthwhile, solid cow for magic goddamn beans, except that outside of fairy tales, beans are only ever beans.

Think about a custom furniture maker who takes down the project details, then takes that information to Ikea, asks the salesperson for something similar, bangs it together in an afternoon, and delivers that instead. Some people are going to be thrilled that they have their furniture that very same day, sure. They may not even notice that the furniture is painted particle board and not a decent slab of hardwood, because they didn't think to specify that in the details. But they for sure will notice that the third or fourth time they have to move it, that it starts coming apart and needs more and more-intense repairs to keep it in one piece. And when this furniture maker does finally get a project that specifies aged mahogany, mitred dovetail joints, and hand-carved engravings, they just do not have the skills to build it. This is the future we're headed towards.

There's two important ways to evaluate these reduced capacities: in the near term, i.e. what it means to projects we are working on currently, and in the long term, i.e. what this means to the greater industry.

The near term is bad enough. The obvious and immediate drawback is to code quality, as AI-generated slop starts sneaking into codebases. Tight timelines and aggressive deadlines (which are more and more frequent everywhere, have you noticed?) means that we've had to rely more and more on trust that our co-workers are doing the good job they say they are. I love when I get some time to do some truly focused review on submitted PRs, but all too frequently, it's hard to justify more than a merely "decent" evaluation, plus maybe a closer look at a few key areas before you have to stamp it with the ol' LGTM and get back to trying to hit the deadlines on your own tasks. (This is a problem in its own right, but again, not the subject of this essay.) This process of leaning on trust makes it incredibly easy for bad LLM code to slip in, because the number one thing that LLMs are good at is generating text that looks like it belongs. Again, they don't understand, so they can't be relied on to write correct code, but they can be relied on to write code that looks correct, so these time-restricted, trust-reliant, scan-based reviews are tremendously vulnerable. And even when you do have time to do a more-thorough review, you're not perfect and will never be. In security, this is why you rely on "defence in depth," which in coding really ought to start with "the coder submitting the PR believes they thoroughly understand and stand behind the code they are submitting!"

I'm struggling at this point not to write "but it's worse than that" over and over and over again, but, well, it's worse than that. Not only is this extra source of bad code increasingly finding its way into our systems, but the developers who should be writing and reviewing and revising this code are just not learning the skills required to do so. Moreover, they're not learning about the intricacies and pain points of the system. They just won't know about the parts of the system that are hard to reason about, that are more likely to introduce bugs when worked on, that can be expanded on safely and quickly or that require a refactor before any serious work can be done with it -- they just won't get the experience to develop a low-level understanding of the system that helps them reliably and quickly continue to build with it. They're farming that work out to co-workers who do have that understanding, who leave ever-more-frustrated comments on reviews asking them to please consider the broader context of the work they're doing, and then they pipe that into their prompt, asking their LLM to "Please revise this PR taking Maria's comments into account then commit then push." And poor Maria has to repeat this cycle of indirectly prompting an LLM until between the three of them, they get lucky and the problem is finally solved, all while trying to get her own work done as well.

Beyond just the systems being worked on at any one company, though, there's the effects on the wider industry - junior developers who become dependant on LLMs can likely seem like incredibly fast learners and can likely present as intermediate developers instead. This doesn't truly reflect their skills - they may seem faster than junior/intermediates who are learning for and by themselves, but beyond that, I highly doubt an LLM-dependant developer can even exist at the senior level, and especially not any higher.

Note: I'm not saying you can't use LLMs and be considered a senior developer, I'm talking about dependance. i.e. Do you yourself know how to program, do you have an intuitive grasp of how various common algorithms and data structures work, can you personally make logical leaps when debugging thorny issues? Or are you relying solely on LLMs for this competency? If the latter, I would absolutely not call you a senior developer, and I wouldn't let you within rock-throwin' distance of a keyboard that could even conceivably commit to any repositories I cared about.

Obviously there's a limit to how much of this bathwater I'm going to throw out - there's a baby in there somewhere! For instance, given how terrible Google's search has become recently, asking an LLM to provide you with a decent introduction to a hard-to-research topic is not a bad place to start gaining familiarity! (Provided, of course, you fact-check what it tells you.) As developers, we use all kinds of tools and technology to make our life easier and to make writing code less onerous. Intellisense is a godsend for people like myself with memory issues (I can remember the "vibe" of a function, but I hate wasting time remembering if a function is named search_user_text() or search_text_user(), or what the parameter order is, etc.), and while writing MIPS can be a lot of fun, I wouldn't want to write assembly code for my day job -- I'm thankful every day for high-level languages.

But the difference is that those tools tend to work to enhance our understanding of the problem domain, while encapsulating things that are outside of that domain. You don't need to know Linux C source code like the back of your hand to get your job done (unless your job is working on the Linux kernel, I suppose), but tools that make it easier to represent and understand your business domain are critical to a modern developer. LLMs are the opposite of that - they tend to generate code that looks like their sample data, not necessarily your codebase, and they remove the need for you to understand your tools or domain at all! ...at least for a surface-level approximation of work.

That famous Steve Jobs saying, "A players hire A players, B players hire C players?" Well, it may have been heartless, but there's a logic to it. That said, I think it needs updating for the modern era: "A players hire A players, B players hire C players who hire AI chatbots and pretty soon nobody knows how the fuck anything works anymore."

Photo by Adam Jaime on Unsplash

Tagged in:

Potentially Bad Advice Programming

AI code generation as an agent of tech debt creation

About the Author

Dan Hulton

Check latest articles from this author:

AI code generation as an agent of tech debt creation

Maybe don't call someone an asshole?

I've soured on Open Source

Comments

Previous Article

Maybe don't call someone an asshole?

AI code generation as an agent of tech debt creation

Maybe don't call someone an asshole?

I've soured on Open Source

Press ESC to close

Or check our Popular Categories...

Like what you read?

Subscribe to our Newsletter

About the Author

Check latest articles from this author:

Comments

Related Articles

Previous Article