@elebertus
Ive read so much LLM code at this point, there are still patterns that are present but elude my understanding, but one thing that's clear is that there are foundational flaw categories that are not improved upon by model version and appear in wildly different projects using wildly different models and harnesses. Testing is a big nexus of those flaws. I am not close to what would be a satisfying explanation of the dynamics, but every project suffers fucked testing problems.
Post
Remote status
Context
3I suspect testing has the same properties as translation. It’s moderately easy to build machine-translation systems that are kind-of okay. A mechanical dictionary is a reasonable approximation. If something goes through your post, looks up each word in an English-French dictionary (for example) and outputs the resulting text, it won’t be correct, but it will be vaguely comprehensible. If you build a dictionary of bigrams or trigrams (sequences of 2-3 words) this gets a bit better because now collocations are more likely to be translated correctly. It won’t be as good as a professional translator, but it will more or less look like the target language. Add more statistical modelling and you will get better up to a point. But there’s a cliff where you can’t improve without actually understanding the content. No amount of statistical modelling will let you accurately translate the things that are statistical outliers and the extrinsic knowledge necessary means that you can’t infer a correct translation from the text alone without understanding its context.
Tests have a similar property. Good tests convey the intention, but the intention is not part of the code and so can’t be inferred from it. Good tests cover the things that the test author knows are corner cases, but these can’t be inferred from the code either (a few can, if the language has explicit error-handling constructs) because they’re a property of the input data.
In both cases, LLMs try to compensate for the lack of understanding by having a lot of examples of similar things in their input. If the thing you’re translating is similar to a load of other things, you may not need to understand it to translate it correctly because the first dozen (or hundred, thousand, or whatever scale you need) people to translate something like that did the hard work and you can reuse it. If the thing you’re testing is similar to a load of other things that already exist, someone else may have done the hard work of identifying the common failure modes and expressing intent.
But commonly LLM-generated tests end up testing that the code does what the code does. And that’s not useful. If you want that, just use fuzzing in a harness that tests trace equivalence between two versions of the program (for the same sequence of inputs, do they generate the same output?). That is useful for no-functionality-change-intended patches (typically things that improve performance or simplify unnecessary complexity), but most changes to the codebase are there because you want the behaviour to change. Good tests will fail if you changed something that was part of an API contract but will not fail if you added new behaviour, but tests based on the code will change.
This isn’t limited to LLMs. Some of the LLVM tests are just ‘run this command, the output should look like this’. People typically reject these in review now because long and painful experience showed us that it was hard to refactor when a change broke a test and the change author couldn’t tell if the difference in output came from something we actually cared about or just something that happened to be part of the old version’s output. But humans can, at least, tell the difference in the tests because they understand what it is that they intend with the change that introduces the test.
Don't worry if your wife loses her job to AI. Start worrying if she starts carving the words "no fate" in tables.
Replies
0Fetching replies…