We Can Just Measure Things

(lucumr.pocoo.org)

75 points | by tosh 3 days ago ago

56 comments

  • yujzgzc 12 hours ago ago

    Another, related benefit of LLMs in this situation is that we can observe their hallucinations and use them for design. I've come up with a couple situations where I saw Copilot hallucinate a method, and I agreed that that method should've been there. It helps confirm whether the naming of things makes sense too.

    What's ironic about this is that the very things that TFA points out are needed for success (test coverage, debuggability, a way to run locally etc) are exactly the things that typical LLMs themselves lack.

    • crazygringo 12 hours ago ago

      I've found LLM's to be extremely helpful in naming and general function/API design, where there a lot of different ways to express combinations of parameters.

      I know what seems natural to me but that's because I'm extremely familiar with the internal workings of the project. LLM's seem to be very good at coming with names that are just descriptive enough but not too long, and most importantly follow "general conventions" from similar projects that I may not be aware of. I can't count the number of times an LLM has given me a name for a function that I've thought, oh of course, that's a much clearer name that what I was using. And I thought I was already pretty good at naming things...

      • autobodie 8 hours ago ago

        More often, LLMs give me wildly complex and over-engineered solutions.

        • steveklabnik 8 hours ago ago

          I have had this happen too, but a "this seems complex, do we need that complexity? can you justify it? or can we make this simpler?" or similar has them come back with something much better.

          • istjohn 5 hours ago ago

            I get good results with: "Can this be improved while maintaining simplicity and concision?"

          • autobodie 4 hours ago ago

            When I do that, they come back with something more simple than before but either wrong or still unnecessarily complex.

            Over the past week, I have been writing a small library for midi controller I/O and simple/elegant is the priority. I am not really that opinionated. I just want it to not be overengineered. AI has been able to make some suggestions when I give it a specific goal for refactoring a specific class, but it cannot solve a problem on its own without overengineered solutions.

            • sfn42 40 minutes ago ago

              You just have to be more specific. Don't just tell it to refactor, tell it how to refactor. I usually start out a bit vague then I add more specific instructions when I want specific changes.

              I often just make the changes myself because it's faster than describing them.

              You do the thinking, LLM does the writing. The LLM doesn't solve problems, that's your job. The LLMs job is to help you do the job more efficiently. Not just do it for you.

  • layer8 15 hours ago ago

    We can just measure things, but then there’s Goodhart's law.

    With the proposed way of measuring code quality, it’s also unclear how comparable the resulting numbers would be between different projects. If one project has more essential complexity than another project, it’s bound to yield a worse score, even if the code quality is on par.

    • Marazan 14 hours ago ago

      I would argue you can't compare between projects due to the reasons you state. But you can try and improve the metrics within a single project.

      Cycolmatic complexity is a terrible metric to obsesses over yet in a project I was on it was undeniably true that the newer code written by more experienced Devs was both subjectively nicer and also had lower cycolmatic complexity than the older code worked on by a bunch of juniors (some of the juniors had then become some of the experienced Devs who wrote the newer code)

      • layer8 13 hours ago ago

        > But you can try and improve the metrics within a single project.

        Yes. But it means that it doesn’t let you assess code quality, only (at best) changes in code quality. And it’s difficult as soon as you add or remove functionality, because then it isn’t strictly speaking the same project anymore, as you may have increased or decreased the essential complexity. What you can assess is whether a pure refactor improves or worsens a project’s amenibility to AI coding.

  • GardenLetter27 13 hours ago ago

    I'm really skeptical of using current LLMs for judging codebases like this. Just today I got Gemini to solve a tricky bug, but it only worked after providing it more debug output after solving some of it myself.

    The first time I tried without the deeper output, it "solved" it by writing a load of code that failed in loads of other ways, and ended up not even being related to the actual issue.

    Like you can be certain it'll give you some nice looking metrics and measurements - but how do you know if they're accurate?

    • the_mitsuhiko 12 hours ago ago

      > I'm really skeptical of using current LLMs for judging codebases like this.

      I'm not necessarily convinced that the current generation of LLMs are overly amazing at this, but they definitely are very good at measuring inefficiency of tooling and problematic APIs. That's not all the issues, but it can at least be useful to evaluate some classes of problems.

    • falcor84 12 hours ago ago

      What do you mean that it "ended up not even being related to the actual issue"? If you give it a failing test suite to turn green and it does, then either its solution is indeed related to the issue, or your tests are incomplete; so you improve the tests and try again, right? Or am I missing something?

      • GardenLetter27 12 hours ago ago

        It made the other tests fail, I wasn't using it in agent mode, just trying to debug the issue.

        The issue is that it can happily go down the completely wrong path and report exactly the same as though it's solved the problem.

      • cmrdporcupine 12 hours ago ago

        I explain this in sibling-node comment but I've caught Claude multiple times in the last week just inserting special-case kludges to make things "pass", without actually successfully fixing the underlying problem that the test was checking for.

        Just outright "if test-is-running { return success; }" level stuff.

        Not kidding. 3 or 4 times in the past week.

        Thinking of cancelling my subscription, but I also find it kind of... entertaining?

        • falcor84 38 minutes ago ago

          I found that working with an AI is most productive when I do so in an Adversarial TDD state of mind. As described in this classic qntm post [0] following the VW emissions scandal, which concludes with:

          > Honestly? I blame the testing regime here, for trusting the engine manufacturers too much. It was foolish to ever think that the manufacturers were on anybody's side but their own.

          > It sucks to be writing tests for people who aren't on your side, but in this case there's nothing which can change that.

          > Lesson learned. Now it's time to harden those tests up.

          [0] https://qntm.org/emissions

        • jiggawatts 11 hours ago ago

          I just realised that this is probably a side-effect of a faulty training regime. I’ve heard several industry heads say that programming is “easy” to generate synthetic data for and is also amenable to training methods that teach the AI to increase the pass rate of unit tests.

          So… it did.

          It made the tests pass.

          “Job done boss!”

    • cmrdporcupine 12 hours ago ago

      I have mixed results but one of the more disturbing things I've found Claude doing is that when confronted with a failing test case, and not being able to solve a tricky problem.. just writing a kludge into the code that identifies that here's a test running, and makes it pass. But only for that case. Basically, totally cheating.

      You have to be super careful and review everything because if you don't you can find your code littered with this strange mix of seeming brilliance which makes you complacent... and total Junior SWE behaviour or just outright negligence.

      That, or recently, it's just started declaring victory and claiming to have fixed things, even when the test continues to fail. Totally trying to gaslight me.

      I swear I wasn't seeing this kind of thing two weeks ago, which makes me wonder if Anthropic has been turning some dials...

      • throwdbaaway 5 hours ago ago

        It is possible that the tasks you gave to the model previously were just about easy enough for it to handle, while the few failing tasks you gave recently were a bit too tough for the model, thus it had to cheat.

        For the exact same task, some changes in the system prompt used by Claude Code, and/or how it constructs the user prompt, can quite easily make the task either easy enough or not. It is a fine line.

      • alwa 11 hours ago ago

        I also feel like I’ve seen a lot more of these over the past week or two, whereas I don’t remember noticing it at all before then.

        It feels like it’s become grabbier and less able to stay in its lane: ask for a narrow thing, and next thing you know it’s running hog wild across the codebase shoehorning in half-cocked major architectural changes you never asked for. [Ed.: wow, how’s that for mixing metaphors?]

        Then it smugly announces success, even when it runs the tests and sees them fail. “Let me test our fix” / [tests fail] / [accurately summarizes the way the tests are failing] / “Great! The change is working now!”

        • cmrdporcupine 11 hours ago ago

          Yes, or I've seen lately "a few unrelated tests are failing [actually same test as before] but the core problem is solved."

          After leaving a trail of mess all over.

          Wat?

          Someone is changing some weights and measures over at Anthropic and it's not appreciated.

      • quesera 12 hours ago ago

        > identifies that here's a test running, and makes it pass. But only for that case

        My team refers to this as a "VW Bugfix".

  • stephc_int13 7 hours ago ago

    I think we very rarely can measure things as soon as they have more than one dimension and unit. Measurements aggregate are weighted and thus arbitrary and or incomplete.

    This is a common and irritating intellectual trap. We want to measure things as this gives us a handle to apply algorithms or logical processes on them.

    But we can only measure very simple and well defined dimensions such as mass, length, speed etc.

    Being measurable is the exception, not the rule.

  • ToucanLoucan 16 hours ago ago

    Still RTFA but this made me rage:

    > In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong

    I recently implemented Microsoft's MSAL authentication on iOS which includes as you might expect a function that retrieves the authenticated accounts. Oh sorry, I said function, but there's two actually: one that retrieves one account, and one that retrieves multiple accounts, which is odd but harmless enough right?

    Wrong, because whoever designed this had an absolutely galaxy brained moment and decided if you try and retrieve one account when multiple accounts are signed in, instead of, oh I dunno, just returning an error message, or perhaps returning the most recently used account, no no no, what we should do in that case is throw an exception and crash the fucking app.

    I just. Why. Why would you design anything this way!? I can't fathom any situation you would use the one-account function in when the multi-account one does the exact same fucking thing, notably WITHOUT the potential to cause a CRASH, and just returns a set of one, and further, why then if you were REALLY INTENT ON making available one that only returned one, it wouldn't itself just call the other function and return Accounts.first.

    </ rant>

    • layer8 15 hours ago ago

      How is an exception different from “returning an error message”?

      • dewey 15 hours ago ago

        Seems like the main differentiator is that one crashed and one doesn’t. Unrelated to error message or exception.

        • layer8 15 hours ago ago

          I understood “crashing” as them not catching the exception.

          Most functions can fail, and any user-facing app has to be prepared for it so that it behaves gracefully towards the user. In that sense I agree that the error reporting mechanism doesn’t matter. It’s unclear though what the difference was for the GP.

        • johnmaguire 15 hours ago ago

          I'm not sure I understand how both occurred at once. Typically an uncaught exception will result in a crash, but this would generally be considered an error at the call site (i.e. failing to handle error conditions.)

      • wat10000 7 hours ago ago

        The iOS UI languages (ObjC and Swift) have three different mechanisms that are in the realm of exceptions/errors.

        ObjC has a widespread convention where a failable method will take an NSError** parameter, and fill out that parameter with an error object on failure. (And it's also supposed to indicate failure with a sentinel return value, but that doesn't matter for this discussion.) This is used by nearly all ObjC APIs.

        Swift has a language feature for do/try/catch. Under the hood, this is implemented very similarly to the NSError* convention, and the Swift compiler will automatically bridge them when calling between languages. Notably, the implementation does not do stack unwinding, it's just returning an error to the caller by mostly normal means, and the caller checks for errors with the equivalent of an if statement after the call returns. The language forces you to check for errors when making a failable call, or make an explicit choice to ignore or terminate on errors.

        ObjC also has exceptions. In modern ObjC, these are implemented as C++ exceptions. They used to be used to signal errors in APIs. This never worked very well. One reason is that ObjC doesn't have scoped destructors, so it's hard to ensure cleanup when an exception is thrown. Another reason is that older ObjC implementations didn't use C++ exceptions, but rather setjmp/longjmp, which is quite slow in the non-failure case, and does exciting things like reset some local variables to the values they had when entering the try block. It was almost entirely abandoned in favor of the NSError* technique and only shows up in a few old APIs these days.

        Like C++, there's no language enforcement making sure you catch exceptions from a potentially throwing call. And because exceptions are rarely used in practice, almost no code is exception safe. When an exception is thrown, it's very likely the program will terminate, and if there happens to be an exception handler, it's very likely to leave the program in a bad state that will soon crash.

        As such, writing code for iOS that throws exceptions is an exceptionally bad idea.

      • ToucanLoucan 14 hours ago ago

        For one: terminating execution

        More importantly: why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind. I wouldn't call our use of the framework an edge case by any means, it opens a web form in which one puts authentication details, passes through the flow, and then we are given authentication tokens and the user data we need. It's not unheard of for more than one account to be returned (especially on our test devices which have many) and I get the one-account function not being suitable for handling that, my question is... why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?

        • kfajdsl 14 hours ago ago

          > For one: terminating execution

          Seems like you should have a generic error handler that will at a minimum catch unexpected, unhandled exceptions with a 'Something went wrong' toast or similar?

        • zahlman 13 hours ago ago

          > For one: terminating execution

          Not if you handle the exception properly.

          > why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind.

          Because you explicitly asked for "the" account, and your request is based on a false premise.

          >why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?

          Because other users of the library explicitly want that to be an error condition, and would rather not write the logic for it themselves.

          Performance could factor into it, too, depending on implementation details that obviously I know nothing about.

          Or for legacy reasons as described in https://news.ycombinator.com/item?id=44321644 .

        • TOGoS 14 hours ago ago

          > why is having more than one account an "exception" at all? That's not an error or fail condition

          It is if the caller is expecting there to be exactly one account.

          This is why I generally like to return a set of things from any function that might possibly return zero or more than one things. Fewer special cases that way.

          But if the API of the function is to return one, then you either give one at random, which is probably not right, or throw an exception. And with the latter, the person programming the caller will be nudged towards using the other API, which is probably what they should have done anyway, and then, as you say, the returns-one-account function should probably just not exist at all.

          • lazide 14 hours ago ago

            Chances are, the initial function was written when the underlying auth backend only supported a single account (structurally), and most clients were using that method.

            Then later on, it was figured out that multiple accounts per credential set (?!?) needed to be supported, but the original clients still needed to be supported.

            And either no one could afree on a sane convention if this happened (like returning the first from the list), or someone was told to ‘just do it’.

            So they made the new call, migrated themselves, and put in a uncaught exception in the old place (can’t put any other type there without breaking the API) and blam - ticket closed.

            Not that I’ve ever seen that happen before, of course.

            Oh, and since the multi-account functionality is obviously new and probably quite rare at first, it could be years before anyone tracks down whoever is responsible, if ever.

            • layer8 13 hours ago ago

              There’s no good way to solve this, though. Returning an arbitrary account can have unpredictable consequences as well if it isn’t the expected one. It’s a compatibility break either way.

              • ToucanLoucan 13 hours ago ago

                > There’s no good way to solve this, though.

                Yes there is! Just get rid of it. It's useless. The re-implementation from using one to the other was barely a few moments of work, and even if you want to say "well that's a breaking change" I mean, yeah? Then break it. I would be far less annoyed if a function was just removed and Xcode went "hey this is pointed at nothing, gotta sort that" rather than letting it run in a way that turns the use of authentication functionality into a landmine.

                • lazide 11 hours ago ago

                  I take it you’ve never had to support a widely used publicly available API?

                  You might be bound to support these calls for many, many years.

              • lazide 13 hours ago ago

                Exactly, which is probably why a better ‘back compatibility’ change couldn’t be agreed on.

                But there is a way that closes your ticket fast and will compile!

                • layer8 13 hours ago ago

                  Sure, but not introducing the ability to be logged into multiple accounts isn’t the best choice as well. Arguably, throwing an exception upon multiple logins for the old API is the lesser evil overall.

    • audiodude 8 hours ago ago

      To me, it makes sense that "Give me the active/main/primary account", when multiple accounts are signed in, is inherently ambiguous. Which account is the main one? You suggest Accounts.first. Is that the first account that was signed into 3 years ago? Maybe you don't want that one then. Is it the most recently signed into account?

      The designer of the API decided that if you ask for "the single account" when there are multiple, that is an error condition.

    • Jabrov 12 hours ago ago

      "crash the app" sounds like the app's problem (ie. not handling exceptions properly) as opposed to the design of the API. It doesn't seem that unreasonable to throw an exception if unexpected conditions are hit? Also, more likely than not, there is probably an explicit reason that an exception is thrown here instead of something else.

    • raincole 12 hours ago ago

      > nowadays it's a “skill issue”

      > throw an exception and crash the fucking app

      Yes, if your app crashes when a third-party API throws an exception, it's a "skill issue" of you. This comment is an example why sometimes blaming the user's skill issue is valid.

    • jiggawatts 11 hours ago ago

      At the risk of being an amateur psychologist, your approach feels like that of a front end developer used to a forgiving programming model with the equivalent of the old BASIC programming language statement ON EFROR RESUME NEXT.

      Server side APIs and especially authentication APIs tend towards the “fail fast” approach. When APIs are accidentally mis-used this is treated either as a compiler error or a deliberate crash to let the developer know. Silent failures are verboten for entire categories of circumstances.

      There’s a gradient of: silent success, silent failure, error codes you can ignore, exceptions you can’t, runtime panic, and compilation error.

      That you can’t even tell the qualitative difference between the last half of that list is why I’m thinking you’re primarily a JavaScript programmer where only the first two in the list exist for the most part.

  • timhigins 7 hours ago ago

    The title of this post really doesn’t match the core message/thesis, which is a disappointing trend in many recent articles.

  • lostdog 16 hours ago ago

    A lot of the "science" we do is experimenting on bunches of humans, giving them surveys, and treating the result as objective. How many places can we do much better by surveying a specific AI?

    It may not be objective, but at least it's consistent, and it reflects something about the default human position.

    For example, there are no good ways of measuring the amount of technical debt in a codebase. It's such a fuzzy question that only subjective measures work. But what if we show the AI one file at a time, ask "Rate, 1-10, the comprehensibility, complexity, and malleability of this code," and then average across the codebase. Then we get measure of tech debt, which we can compare over time to measure if it's rising or falling. The AI makes subjective measurements consistent.

    This essay gives such a cool new idea, while only scratching the surface.

    • delusional 16 hours ago ago

      > it reflects something about the default human position

      No it doesn't. Nothing that comes out of an LLM reflects anything except the corpus it was trained on and the sampling method used. That definitionally true, since those are the very things it is a product of.

      You get NO subjective or objective insight from asking the AI about "technical debt" you only get an opaque statistical metric that you can't explain.

      • BriggyDwiggs42 15 hours ago ago

        If you knew that the model never changed it might be very helpful, but most of the big providers constantly mess with their models.

        • cwillu 15 hours ago ago

          Even if you used a local copy of a model, it would still just be a semi-quantitative version of “everyone knows ‹thing-you-don't-have-a-grounded-argument-for›”

        • layer8 15 hours ago ago

          Their performance also varies depending on load (concurrent users).

          • BriggyDwiggs42 14 hours ago ago

            Dear god does it really? That’s very funny.

            • wiseowise 3 hours ago ago

              Why are you surprised? It’s a computational thing, after all.

  • elktown 13 hours ago ago

    I think this is advertisement for an upcoming product. Sure, join the AI gold rush, but at least be transparent about it.

    • falcor84 12 hours ago ago

      Even if he does have some aspiration to make money by operationalizing this (which I didn't sense that he does), what Armin describes there is something that's almost trivial to implement a basic version of yourself in under an hour.

      • elktown 12 hours ago ago

        > which I didn't sense that he does

        I'd take a wager.

        • the_mitsuhiko 12 hours ago ago

          If your wager is that I will build an AI code quality measuring tool then you will lose it. I'm not advertising anything here, I'm just playing with things.

          • elktown 11 hours ago ago

            > code quality measuring tool

            I didn't, just an AI tool in general.