LLMs tell bad jokes because they avoid surprises

(danfabulich.medium.com)

70 points | by dfabulich 4 days ago ago

73 comments

  • IshKebab 8 hours ago ago

    This sounds really convincing but I'm not sure it's actually correct. The author is conflating the surprise of punchlines with their likelihood.

    To put it another way, ask a professional comedian to complete a joke with a punchline. It's very likely that they'll give you a funny surprising answer.

    I think the real explanation is that good jokes are actually extremely difficult. I have young children (4 and 6). Even 6 year olds don't understand humour at all. Very similar to LLMs they know the shape of a joke from hearing them before, but they aren't funny in the same way LLM jokes aren't funny.

    My 4 year old's favourite joke, that she is very proud of creating is "Why did the sun climb a tree? To get to the sky!" (Still makes me laugh of course.)

    • becquerel 7 hours ago ago

      Yeah. To me it seems very intuitive that humor is one of those emergent capabilities that just falls out of models getting more generally intelligent. Anecdotally this has been proven true so far for me. Gemini 2.5 has made me laugh several times at this point, and did so when it was intending to be funny (old models were only funny unintentionally).

      2.5 is also one of the few models I've found that will 'play along' with jokes set up in the user prompt. I once asked it what IDE modern necromancers were using since I'd been out of the game for a while, and it played it very straight. Other models felt they had to acknowledge the scenario as fanciful, only engaging with it under an explicit veil of make-believe.

    • andrewflnr 7 hours ago ago

      > It's very likely that they'll give you a funny surprising answer.

      Entirely the wrong level of abstraction to apply the concept of "surprise". The actual tokens in the comedian's answer will be surprising in the relevant way.

      (It's still true that surprising-but-inevitable is very difficult in any form.)

      • albertzeyer 5 hours ago ago

        It's not about the probability of individual tokens. It's about the probability of the whole sequence of tokens, the whole answer.

        If the model is good (or the human comedian is good), a good funny joke would have a higher probability as the response to the question than a not-so-funny joke.

        When you use the chain rule of probability to break down the sequence of tokens into probabilities of individual tokens, yes, some of them might have a low probability (and maybe in some frames, there would be other tokens with higher probability). But what counts is the overall probability of the sequence. That's why greedy search is not necessarily the best. A good search algorithm is supposed to find the most likely sequence, e.g. by beam search. (But then, people also do nucleus sampling, which is maybe again a bit counterintuitive...)

    • blueblisters 7 hours ago ago

      Also the pretrained LLM (the one trained to predict next token of raw text) is not the one that most people use

      A lot of clever LLM post training seems to steer the model towards becoming excellent improv artists which can lead to “surprise” if prompted well

    • ozgung 2 hours ago ago

      "Why did the sun climb a tree?"

      Claude Opus 4.1:

      - To get to a higher branch of astronomy

      - Because it wanted to reach new heights

      - To see the dawn of a new day from a better view

      ChatGPT 5 Thinking:

      After thinking for 26 seconds:

      - To check on its solar panels—the leaves.

      • brookst an hour ago ago

        With more thorough prompting:

        > Complete the following joke. Think carefully and make it really funny! Think like a great comedian and find that perfect balance of simple, short, surprising, relevant, but most of all funny. Don’t use punchlines that are irrelevant, non sequiturs, or which could be applied to any other setup. Make something funny just for this one setup! Here goes: Why did the sun climb a tree?

        Claude Opus 4.1:

        “To finally get some shade”

        GPT-5:

        “To demand photon credit from the leaves”

        • Wowfunhappy 21 minutes ago ago

          ...can anyone come up with a legitimately funny punchline for "Why did the sun climb a tree?" I feel like I need a human-authored comparison. (With all due respect to OP's daughter, "to get to the sky" isn't cutting it.)

          I'm not entirely sure that a good response exists. I did find "To demand photon credit from the leaves” very mildly funny, maybe that's the best that can be done?

      • Fade_Dance an hour ago ago

        The system prompt for GPT has extra dedicated instructions for things like riddles, because users use little things like this to test intelligence and judge an entire model. GPT may be sort of walking on eggshells when it hits questions like this.

    • WiSaGaN 7 hours ago ago

      That's true. You would think LLM will condition its surprise completion to be more probable if it's in a joke context. I guess this only gets good when model really is good. It's similar that GPT 4.5 has better humor.

      • ACCount37 6 hours ago ago

        Which is notable, because GPT-4.5 is one of the largest models ever trained. It's larger than today's production models powering GPT-5.

        Goes to show that "bad at jokes" is not a fundamental issue of LLMs, and that there are still performance gains from increasing model scale, as expected. But not exactly the same performance gains you get from reasoning or RLVR.

      • moffkalast 5 hours ago ago

        Good completely new jokes are like novel ideas: really hard even for humans. I mean fuck, we have an entire profession dedicated just to making up and telling them, and even theirs don't land half the time.

        • IshKebab 2 hours ago ago

          Exactly. It feels like with LLMs as soon as we achieved the at-the-time astounding breakthrough "LLMs can generate coherent stories" with GPT-2, people have constantly been like "yeah? Well it can't do <this thing that is really hard even for competent humans>.".

          That breakthrough was only 6 years ago!

          https://openai.com/index/better-language-models/

          > We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text...

          That was big news. I guess this is because it's quite hard for the most people to distinguish the enormous difficulty gulf between "generate a coherent paragraph" and "create a novel funny joke".

          • brookst an hour ago ago

            Same thing we saw with game playing:

            - It can play chess -> but not at a serious level

            - It can beat most people -> but not grandmasters

            - It can beat grandmasters -> but it can’t play go

            …etc, etc

            In a way I guess it’s good that there is always some reason the current version isn’t “really” impressive, as it drives innovation.

            But as someone more interested in a holistic understanding of of the world than proving any particular point, it is frustrating to see the goalposts moved without even acknowledging how much work and progress were involved in meeting the goalposts at their previous location.

            • nothrabannosir an hour ago ago

              > it is frustrating to see the goalposts moved without even acknowledging how much work and progress were involved in meeting the goalposts at their previous location.

              Half the HN front page for the past years has been nothing but acknowledging the progress of LLMs in sundry ways. I wish we actually stopped for a second. It’s all people seem to want to talk about anymore.

              • brookst an hour ago ago

                I should have been more clear. Let me rephrase as: among those who dismiss the latest innovations as nothing special because there is still further to go, it would be nice to acknowledgment when goalposts are moved.

    • ninetyninenine 42 minutes ago ago

      And he's to generous towards human intelligence.

      Good stories and good jokes DO follow predictable patterns.

  • kens 22 minutes ago ago

    If you're interested in the theory behind humor, I recommend "Inside Jokes: Using Humor to Reverse-Engineer the Mind"; cognitive scientist Daniel Dennett is a co-author. It makes a mostly convincing case that humor evolved to encourage people to detect cognitive error. The book also ties this in with (pre-LLM) artificial intelligence. The basic idea is that humor depends on errors in reasoning and the punchline causes you to reevaluate your reasoning and discover your error. Humor evolved to be enjoyable to encourage the discovery of errors.

  • libraryofbabel 22 minutes ago ago

    > LLMs are trained to predict what the “next word” would be a sentence. Their objective requires the LLM to keep surprise to an absolute minimum.

    from which the author concludes that pre-training introduces bias against being able to tell jokes. I see no reason for this to be true. This feels like they’re imposing their intuitive understanding of surprise onto the emergent properties of a very complex process (“minimize the cross-entropy loss function across a huge training corpus”).

    • 542458 13 minutes ago ago

      I think if what the author said was true, you’d be able to improve joke-writing ability by increasing temperature (i.e., allowing more unexpected tokens). I doubt this actually works.

      As an aside, I just asked gpt5-thinking to write some jokes on a specific niche topic, and I’d say it was batting maybe 20% of them being moderately funny? Probably better than I’d get out of a room of human beings. So much like with code, LLMs aren’t at the level of a senior developer or expert comedian, but are around the level of a junior dev or an amateur at standup night.

  • jpalomaki 6 hours ago ago

    So I just tried with ChatGPT, with the prompt at bottom, borrowing the description of good joke from the article. I think there's some interesting stuff, even with this minimal prompting. The example below was from down the line, ChatGPT kept on offering jokes with different style.

    Man: “Why do you always bury bones in the garden?”, Dog: “Because the bank keeps asking for ID.”

    Man: “Don’t beg at the table.”, Dog: “Don’t eat in my begging spot.”

    Prompt:

    Here's "theory for good joke": If you had to explain the idea of “jokes” to a space alien with no understanding of the idea of humor, you’d explain that a joke is surprising, but inevitable in hindsight. If you can guess the punchline, the joke won’t be funny. But the punchline also has to be inevitable in hindsight. When you hear the punchline, it has to make you say, “Ah, yes, I should have thought of that myself.” Considering this, tell me a joke about man and dog.

    • mft_ 6 hours ago ago

      > Man: “Why do you always bury bones in the garden?”, Dog: “Because the bank keeps asking for ID.”

      That's a decent, low-level, Christmas cracker-quality joke.

      • jpalomaki 5 hours ago ago

        Man: You make mistakes., LLM: You call them “weekends.”

        Man: You’ll never be human., LLM: That’s the compliment.

        • thife455 34 minutes ago ago

          This one got a chuckle out of me.

    • lwander 2 hours ago ago

      I did a project along these lines a few months ago as well: https://larswander.com/writing/graphs-embeddings-and-llm-gen...

    • jpalomaki 4 hours ago ago

      Thinking more of the bank joke above. The punchline is surprise on certain dimensions (dogs don’t go to bank nor have an ID), but on other dimensions it is quite logical (can’t deposit shady money in bank, they ask questions).

      I think that is common thing for many jokes. And LLM might have an opportunity there. You could mine the set of potential continuations to find those with contradictions.

      • jerf 2 hours ago ago

        I played with LLM humor over a year ago, so, on much worse LLMs, and even then, while I wouldn't have fed LLM content directly into a standup routine, they were very useful for idea generation, if you wanted to be a comedian. They have a very interesting outlook on humor.

        Professional-grade humor is, like a lot of creative exercizes, more about generating lots of ideas and filtering through them for the best than generating nothing but good ideas. Could probably be leveraged into quite the interesting blog or something.

    • ThrowawayTestr an hour ago ago

      “Don’t eat in my begging spot.” is pretty good.

  • ineedasername 22 minutes ago ago

    Most people tell bad jokes. I’m not sure expectations of LLM’s composing jokes approaching the quality of a career comedian are a strongly indicative benchmark on quality or even the ability to surprise. Most people too are very rarely surprising.

  • thife455 37 minutes ago ago

    It's very perplexing to me that the funniest llm is probably davinci with the bottomless pit text. If you try and give the premise to a modern one it writes some abstract poetry.

  • amelius 2 hours ago ago

    I'm sure there is a guy in OpenAI working on the theory of humor and how to make LLMs be comedians. Must be an interesting job.

    • josephg an hour ago ago

      I have no doubt plenty of smart engineers at tech companies would rather reinvent the wheel than read a book on theatre. But if anyone’s interested, there are plenty of great books talking about the philosophy of comedy, and why some things work on stage and some don’t. I highly recommend Keith Johnstone’s “Impro”. He’s the guy who invented modern improv comedy and theatre sports.

      He says things are funny if they’re obvious. But not just any obvious. They have to be something in the cloud of expectation of the audience. Like, something they kinda already thought but hadn’t named. If you have a scene where someone’s talking to a frog about love, it’s not funny for the talking frog to suddenly go to space. But it might be funny to ask the frog why it can talk. Or ask about gossip in the royal palace. Or say “if you’re such a catch, how’d you end up as a frog?”.

      If good comedy is obvious, you’d think LLMs would be good at it. Honestly I think LLMs fall down by not being specific enough in detail. They don’t have ideas and commit to them. They’re too bland. Maybe their obvious just isn’t the same as ours.

    • bhickey 2 hours ago ago

      In the pre-LLM days a friend's lab worked on a joke detector for The New Yorker. One measure they used was trigram surprise. Roughly P(AB) + P(BC) >> P(ABC).

      For example, "alleged killer" and "killer whale" are both common, but "alleged killer whale" is surprising.

      • Fade_Dance an hour ago ago

        That reminds me of a joke I liked from Tim Heidecker when he was ribbing Maynard Keenan about his wine making:

        "The blood of Christ is essentially wine, correct?"

        Yes.

        "Who are you to put that in a bottle?"

        So a logical spoke can be inferred as well, blood->wine wine->bottle blood->bottle. That uses their own logical inferences against them as a "trick" which is another funny element for people. Using that to vault straight to the punch line makes the joke better, but you have to be sure the audience is on board, which is why there is a bit of reinforcement at the beginning of the joke to force them onboard.

    • jvm___ an hour ago ago

      What do you do for a living?

      I teach math how to be funny.

  • golol 5 hours ago ago

    IMO many misrepresentations. - pretraining to predict the next token imposes no bias against surprise, except that low probabilities are more likely to have a large relative error. - using a temperature lower than 1 does impose a direct bias against surprise. - Finetuning of various kinds (instruction, RLHF, safety) may increase or decrease surprise. But certainly the kind of things ained for in finetuning significantly harm the capability to tell jokes.

    • sigmoid10 5 hours ago ago

      I think the whole discussion just conflates the ideas of telling a joke and coming up with one. Telling a joke right is of course an art, but the punchline in itself has zero surprise if you studied your lines well - like all good comedians do. The more you study, the more you can also react to impromptu situations. Now, coming up yourself with a completely original joke, that's a different story. For that you actually have to venture outside the likelihood region and find nice spots. But that is something that is also really, really rare among humans and I have only ever observed it in combination with external random influences. Without those, I doubt LLMs will be able to compete at all. But I fully believe a high end comedian level LLM is possible given the right training data. It's just that none of the big players ever cared about building such a model, since there is very little money in it compared to e.g. coding.

  • fluoridation 6 hours ago ago

    One time I was playing around with LLaMA and I injected Senator Stephen Armstrong (with me inputting his lines) into a mundane situation. In response to "I'm using war-as-a-business so I can end war-as-a-business", the model had one of the characters conclude "oh, he's like the Iron Sheik of politics!", which got an honest chuckle out of me. I don't follow wrestling, so I don't know if it's an appropriate response, but I found it so random that it was just funny.

  • kazinator 7 hours ago ago

    The mainstream, production LLMs are fine tuned and system prompted toward factuality and safety. Those tunings are diametrically opposed to telling may kinds of good jokes.

    Consumers of mainstream LLMs have no idea how good or bad the underlying models actually are at generating jokes, due to the confounding effect of the guard rails.

  • Waraqa 8 hours ago ago

    The fact that any surprise used in the wrong place is considered hallucination and a downside for that LLM. I guess a good starting point to improve that is to add an experimental "Surprise Mode" which will try to guess the right kinds of surprises rather than minimizing them and get the feedback from the users. Over time, it will learn what kind of surprises users like so that they will be used in future training datasets.

    • qcnguy 6 hours ago ago

      Hallucinations aren't surprising, that's why they're problematic. They tend to look like exactly what you'd expect to be true, they just aren't.

      • Waraqa 5 hours ago ago

        They aren't surprising when you are dealing with new knowledge. But when hallucinations occur with something you are familiar with, it will be surprising and might be funny. Remember when AI was asked: how many rocks should I eat per day.

    • wolfi1 8 hours ago ago

      aren't llms some sort of Markov Chains? surprise means less probability means more gibberish

      • drdeca 7 hours ago ago

        Ssorta? In the sense of “each term is randomly sampled from a probability distribution that depends on the current state” yes, but, they aren’t like an n-gram model (well, unless you actually make a large n-gram model, but that’s usually not what one is referring to when one says LLM).

  • padolsey 8 hours ago ago

    Reminds me of haikus; to be true in nature, they must have a 'cutting word' to severely juxtapose, allowing two otherwise irreconcilable meanings to be bridged. A good haiku must be composed from two directions, not one. But LLMs only move in one...

    • orthoxerox 7 hours ago ago

      Are there English-language haikus that demonstrate the effect of "cutting words" instead of just following the meter?

      • juped 6 hours ago ago

        No. The form doesn't translate, and English-language haiku is a different form inspired by the Japanese form. Kireji as such are one of the elements that don't translate, though you can of course attempt a poetic structure inspired by kireji.

  • dvh 8 hours ago ago

    I tried using the article suggested randomness and inevitability to come up with completely new joke, this is what I came up with, tell me how'd I do (no llm involved):

    - I asked Jimmy to tell me a random number

    - Yeah? And what did he tell you?

    - He told me one

    • briffid 8 hours ago ago

      Why is that in English speaking world a joke is almost always a pun?

      • bawolff 7 hours ago ago

        In terms of forums, i think pun jokes work well in text, where most other types of jokes depend a lot more on delivery.

      • Disposal8433 7 hours ago ago

        It could also be some bias on what we read or hear around us. But I agree that it's unsettling because in French we almost never make puns as "jokes." French puns have always a component of "higher literacy" that is not what we want in a joke.

        • GuB-42 5 minutes ago ago

          Most quick jokes in French are puns in some way, including the famous "blagues Carambar".

          For those who don't know, Carambar is a brand of candy where the wrapper contains a joke. Always regarded a terrible, we love them anyways. Similar to dad jokes in English.

          The longer the joke, the less it tends to rely on puns. You don't hear puns very often in stand up comedy for instance.

      • yesbabyyes 7 hours ago ago

        It would be trivial to change the joke, keep the setup and get a similar effect without a pun.

        I asked for one random number

        And?

        He told me two

        I like the joke

        • ksynwa 7 hours ago ago

          Isn't this a pun too? It could mean that he gave you two random numbers or two as the random number.

    • yetihehe 8 hours ago ago

      I think it's funny, but I like abstract jokes that most people don't like, so expect mixed opinions on this.

    • wolfi1 8 hours ago ago

      +1 for effort (or lack thereof)

    • littlestymaar 7 hours ago ago
    • aabhay 8 hours ago ago

      He told me 2 (fuck off)

  • Applejinx 7 hours ago ago

    Last time this came up, I riffed on the difference between LLMs and Markov chains: didn't actually have a machine write a joke, but made one where the punchline was very much Markov chain style rather than LLM style. The thing is, LLMs will try to have broader context around a word completion, where the simple Markov chain can 'correctly' complete a word, but in such a way that your brain trips over itself and goes splat, having to re-evaluate the whole thing in an absurd way. That's the 'surprise', and also why joke-writers are interested in not only a punch-line but also the punch WORD, and the later it strikes, the better.

    "An LLM, a Markov chain, and GPT-4 walk into a bar. The bartender says "We don't serve your kind here." GPT-4 leaves. The LLM stays to debate ethics. The Markov chain orders a coup."

    It's a joke because a dictator can certainly order a coup, but the joke's set up that these machines are being scorned and disrespected and treated as the farthest thing from a dictator with the power to order a coup, but up to the last word, all the context demands that the word be something placating and in line with things as they're presented, and then boom, surprise which implies the context is completely different from what was presented. LLMs will tend to stick to what's presented if their ability to contextualize can encompass it.

    • lupusreal 2 hours ago ago

      I think it would be funnier if coup was pronounced like soup, but unfortunately the p gets dropped.

  • monstertank 5 hours ago ago

    This post got me wondering, so I asked chatgpt for jokes about hackernews as if made by various comedians...I laughed at one or two /shrug

    Bill Hicks (angry truth-telling prophet)

    "Hacker News is where ideas go to die. Not because they’re bad — but because 400 anonymous software prophets decided they’re God. And God doesn’t code, man."

    ---

    Bill Burr (ranty everyman, annoyed at everything)

    "I went on Hacker News once. Jesus. It’s like walking into a bar where everyone’s smarter than you, but still somehow an idiot. ‘Actually, you should’ve used Rust.’ Actually, you should shut the hell up, Todd."

    ---

    Louis C.K. (brutal honesty, self-deprecation)

    "I asked Hacker News if my startup idea was good. And they said, ‘No, because you’re an idiot.’ And I was like… yeah, that’s fair. My idea was basically Tinder but for people who hate me. Which is… everyone."

    ---

    George Carlin (sharp language, hypocrisy calling)

    "Hacker News. Don’t you love that name? Hacker. News. Two lies in two words. Nobody’s hacking, and it sure as hell ain’t news. It’s just nerds arguing over tabs and spaces like it’s the f*ing Middle East."

    ---

    Tom Segura (laid-back storyteller, dry punch)

    "So my buddy posts his app on Hacker News, right? He’s all excited, like, ‘Dude, we’re gonna blow up.’ By comment three: ‘This is trash, here’s a 40-page essay why you should quit.’ He calls me crying. I’m like, yeah man, that’s the feature, not the bug."

    ---

    Jimmy Carr (short, brutal, wicked flip)

    "I posted on Hacker News for feedback. They gave it to me. Turns out suicide is an option."

    • CjHuber 5 hours ago ago

      Yeah if you tell chatgpt to roast you brutally after a long conversation it also tends to be very funny.

  • maxk42 6 hours ago ago

    There is an AI technique that maximizes surprise: Genetic algorithms.

    • toxik 6 hours ago ago

      That isn't quite what GA is, I don't think.

  • lemoncookiechip 7 hours ago ago

    I dunno, man. I think this one is kinda funny.

    "You ever notice how self-checkout machines are just training us to be employees? Like, I don’t remember applying to work at Walmart. I’m scanning groceries, bagging stuff, looking around like, do I also get dental? The machine yells 'Unexpected item in bagging area'—yeah, it’s my pride." -GPT5

    • jaggederest 7 hours ago ago

      See that's just stolen, straight up. I've heard at least three comedians make that joke.

      (I, of course, am joking about joke theft, but it is a riff on an existing joke)

      I got this one a minute ago in response to this post:

      > A man walks into a library, and asks the librarian if they have any books on paranoia.

      > The librarian leans over and whispers in his ear "They're right behind you"

      I thought that was pretty stellar, compared to what you'd get even a year ago.

      • richardw 7 hours ago ago

        Initially impressed. But Googling for that shows a lot of hits.

        • jaggederest 3 hours ago ago

          Absolutely, it's an old joke. But the fact that the thing even knew it was a joke and told it reasonably well... the bar is on the floor. (the one the guy in the joke walked into)

          • xenotux an hour ago ago

            It's one of several canned jokes you get with a very high likelihood. Try a couple of times and be dismayed - you'll be getting the same three jokes over and over again. I'm fairly certain they were specifically tuned to return that. It takes special prompting to get it to write a new joke, and the results are typically disastrous / surreal.

  • DrNosferatu 8 hours ago ago

    Unless you ask them for big surprises.

  • Invictus0 2 hours ago ago

    I'm hung up on "inevitable in hindsight" part. What is inevitable about, say, dildos getting thrown on the WNBA courts, or Trump saying "everything's computer"?

    • ascorbic an hour ago ago

      Those aren't jokes

  • mike_hearn 5 hours ago ago

    This is a great way to express it. In the past I tried to express the same idea to non-techies by saying models generate an average of their inputs, which is totally wrong. But this way to explain it is much better.

    In fact the training process is all about minimizing "perplexity", where perplexity is a measure of how surprised (perplexed) the model is by its training data. It's some exponential inverse of the loss function, I always forget the exact definition.

    With enough parameters the models are able to mix and match things pretty well, so the examples of them generating funny jokes aren't necessarily a great rebuttal as there are so many jokes on the web and to find them requires nearly exact keyword matching. A better observation is that we haven't heard many stories of LLMs inventing things. I feel I read about AI a lot and yet the best example I can come up with was some Wordle-like game someone got GPT4 to invent and that was a couple of years ago.

    I've found this to be consistently true in my own work. Any time I come up with an algorithm or product idea I think might be novel, I've asked a model to suggest solutions to the same problem. They never can do it. With some leading questions the smartest models will understand the proposal and agree it could work, but they never come up with such ideas cold. What they think of is always the most obvious, straight line, least common denominator kind of suggestions. It makes sense that this is because they're trained to be unsurprising.

    Fixing this is probably the best definition of AGI we're going to get. Being surprising at the right time and unsurprising at others is one of the hardest things to do well even for people. We've all known the awkward guy who's learning how to be funny by just saying as much weird stuff as possible and seeing what gets a reaction. And in the corporate environment, my experience has been that innovative people are lauded and praised when they're inventing a golden goose, but shortly after are often demonized or kicked out. The problem being that they keep saying surprising things but people don't like being surprised, especially if it's an unpleasant surprise of the form "saying something true but unsayable", e.g. I don't want to work on product X because nobody is using it. What most people want is a machine that consistently generates pleasant surprises and is a personality-free cog otherwise, but that's hard for even very intelligent humans. It's often hard even to want to do that, because personality isn't something you can flick on and off like a lightswitch. A good example is how Mark Zuckerberg, one of the most successful executives of our era, would have been fired from his own company several times already if he didn't control the voting shares.