@elebertus
Ive read so much LLM code at this point, there are still patterns that are present but elude my understanding, but one thing that's clear is that there are foundational flaw categories that are not improved upon by model version and appear in wildly different projects using wildly different models and harnesses. Testing is a big nexus of those flaws. I am not close to what would be a satisfying explanation of the dynamics, but every project suffers fucked testing problems.
Post
Remote status
Context
2I suspect testing has the same properties as translation. It’s moderately easy to build machine-translation systems that are kind-of okay. A mechanical dictionary is a reasonable approximation. If something goes through your post, looks up each word in an English-French dictionary (for example) and outputs the resulting text, it won’t be correct, but it will be vaguely comprehensible. If you build a dictionary of bigrams or trigrams (sequences of 2-3 words) this gets a bit better because now collocations are more likely to be translated correctly. It won’t be as good as a professional translator, but it will more or less look like the target language. Add more statistical modelling and you will get better up to a point. But there’s a cliff where you can’t improve without actually understanding the content. No amount of statistical modelling will let you accurately translate the things that are statistical outliers and the extrinsic knowledge necessary means that you can’t infer a correct translation from the text alone without understanding its context.
Tests have a similar property. Good tests convey the intention, but the intention is not part of the code and so can’t be inferred from it. Good tests cover the things that the test author knows are corner cases, but these can’t be inferred from the code either (a few can, if the language has explicit error-handling constructs) because they’re a property of the input data.
In both cases, LLMs try to compensate for the lack of understanding by having a lot of examples of similar things in their input. If the thing you’re translating is similar to a load of other things, you may not need to understand it to translate it correctly because the first dozen (or hundred, thousand, or whatever scale you need) people to translate something like that did the hard work and you can reuse it. If the thing you’re testing is similar to a load of other things that already exist, someone else may have done the hard work of identifying the common failure modes and expressing intent.
But commonly LLM-generated tests end up testing that the code does what the code does. And that’s not useful. If you want that, just use fuzzing in a harness that tests trace equivalence between two versions of the program (for the same sequence of inputs, do they generate the same output?). That is useful for no-functionality-change-intended patches (typically things that improve performance or simplify unnecessary complexity), but most changes to the codebase are there because you want the behaviour to change. Good tests will fail if you changed something that was part of an API contract but will not fail if you added new behaviour, but tests based on the code will change.
This isn’t limited to LLMs. Some of the LLVM tests are just ‘run this command, the output should look like this’. People typically reject these in review now because long and painful experience showed us that it was hard to refactor when a change broke a test and the change author couldn’t tell if the difference in output came from something we actually cared about or just something that happened to be part of the old version’s output. But humans can, at least, tell the difference in the tests because they understand what it is that they intend with the change that introduces the test.
Replies
9Don't worry if your wife loses her job to AI. Start worrying if she starts carving the words "no fate" in tables.
@sun
@david_chisnall
That's sucks, sorry that's happening to y'all. Of course whether or not something actually performs the same task or does it as well as a professional translator would is relatively orthogonal to whether the bosses will use that thing to increase profit by laying off workers and papering over failures.
@sun @david_chisnall definitely agreed on that, i am a filthy monolingual and so am the last person who can speak on translation, but as david is saying above it definitely can get to the point where it does the trick most of the time, up to a point when the communicative context escapes the space covered by the training data. that probably includes most of "business-relevant communication" that makes up most of the moneymaking end, but also i can imagine easily leaves people in e.g. complex legal situations navigating asylum or immigration proceedings in a really bad place, particularly when the very non-neutral language surface of the LLM starts to rear its head. like at once it is a miracle that i can get "pretty good" translation on demand any time, and not to be underrated, but as you say some things will be worse not better, and unfortunately most of that "worst" is going to fall on people who already have it bad.
[CW]
Content warning
how'd this get so long it was supposed to be a quick thought, waiting for tests to run
Show
Hide
Content warning
how'd this get so long it was supposed to be a quick thought, waiting for tests to run
@sun i figure the next stage and longer tail of the commercial end after the outrageous debt and speculation in the consumer market pops is latching onto workflow data to provide company- or domain-specific finetunes or strapon autoencoders. i'm sort of surprised on a longer term scale that isn't already the case, that the foundation models can be 'good enough', the economics restrictive enough where the ai giants are not licensing weights, the ui problem where people gravitate towards their favorite box, etc. haven't made that more of a thing yet. it's all still selling prefilled context windows at best, and all those businesses are doing terribly. i do wonder if the consumer market has been too poisoned by the chatbot modality to be able to adopt the 'whole life surveillance/whole life product surface' shift that google and microsoft have been trying to push, and these kinds of economically viable domain-specific products will always just look like Hated Work SaaS App.
if you stretch the boundaries of what you consider "AI" to include an assemblage of purpose-built models and algorithms, then at the outer limits there you start to get back to "normal computing." the language | program interface is still, as far as i can see, an intractable one in the general sense, where the leap from text generation to tool calling can land you somewhere but it can also yeet you off into space. i would say on latest model anthropic and openapi models i get about a 50/50 chance of whether the thing is capable of spawning a subagent or whether it fails and makes excuses about it (and sometimes even simulates the output of a subagent but in the debug logs you can see it failed the tool call). That's a non-negotiable barrier that doesn't have an easy technological solution in sight that prevents the wildest subdomain takeover scenarios, but in the meantime a lot of stuff sort of works and for most people that seems to be alright.
but yeah if one gets past all the sentient god-machine occultism, it's not hard to imagine this generation of tech giants going the same way at the prior (current?) gen, spinning off their magic product into a thousand derivatives that capture a lot of cash, displace a lot of labor, provide some useful services as a byproduct, etc.
most of my serious reservations come from the economic and epistemic violence of it all, and i definitely do active research into the failure modes, but of course they do some things, and as you say it's important to stay appraised of the real capabilities both to calibrate criticism and appreciate what can be done.
[CW]
Content warning
re: how'd this get so long it was supposed to be a quick thought, waiting for tests to run
Show
Hide
Content warning
re: how'd this get so long it was supposed to be a quick thought, waiting for tests to run
[CW]
Content warning
re: how'd this get so long it was supposed to be a quick thought, waiting for tests to run
Show
Hide
Content warning
re: how'd this get so long it was supposed to be a quick thought, waiting for tests to run
@sun i think it would be a real shame if the outcome of the alienation tech was to make us all self-alienate and hate each other. i find most binaries exhausting and unenlightening. like even with monster trucks, i can acknowledge how they're unbelievably wasteful and often enmeshed in some pretty hateful shit, while still being able to go like "whoa holy shit" when they are doing wheelies and flips and whatnot.