Markov chains are the original language models

(elijahpotter.dev)

435 points | by chilipepperhott 5 days ago ago

162 comments

  • AnotherGoodName a day ago ago

    The problem is the linear nature of markov chains. Sure they can branch but after an observation you are absolutely at a new state. A goes to B goes to C etc. A classic problem to understand why this is an issue is feeding in a 2D bitmap where the patterns are vertical but you’re passing in data left to right which Markov chains can’t handle since they are navigating exclusively on the current left to right inout. They miss the patterns completely. Similar things happen with language. Language is not linear and context from a few sentences ago should change probabilities in the current sequence of characters. The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

    I have played with Markov chains a lot. I tried having skip states and such but ultimately you’re always pushed towards doing something similar to the attention mechanism to handle context.

    • t_mann a day ago ago

      You can model multiple-hop dependencies as a Markov chain by just blowing up the state space as a Cartesian product. Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.

      • TeMPOraL a day ago ago

        > Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.

        Right - but this feels like being half-way between transformers and lookup tables; the latter have enormous expressive power too, as long as you're willing to handle even more state. I'm curious what else could be put on that spectrum, and where. Maybe there's something weaker than transformers, but good enough for practical applications, while being more resource-efficient?

        Related thought: understanding and compression seem to be fundamentally the same thing.

        • joquarky a day ago ago

          > Maybe there's something weaker than transformers, but good enough for practical applications, while being more resource-efficient?

          I wish! Gboard is terrible at predicting the correct form of a word.

          It generally knows the best word, but it doesn't get from context that (e.g.) a gerund form of the verb is the most appropriate next

        • impulsivepuppet 16 hours ago ago

          Related and likely inspired by the related thought: https://www.mattmahoney.net/dc/

        • joe_the_user 20 hours ago ago

          But no, as another poster comments, the state space blow up is much worse than transformers for much less [1]. Transformers have more kinds of statefulness while Markov chains seem simpler cause they just multiply the number of states - but that state-multiplication makes them completely impractical. They aren't flexible on a per-state basis, just the opposite.

          Basically, for most forms of complexity, Markov Chains are a strictly bad model. Perhaps if there was a way to "virtualize" the multiplication of states, you could do something reasonable. Idle speculation though.

          [1] https://news.ycombinator.com/item?id=45354488

        • jgalt212 12 hours ago ago

          Most of human life is a lookup table derived action.

          • econ 12 hours ago ago

            I notice that people with poor lookup table hardware need to think a lot more. The public part of the table is mostly populated with their thoughts.

            To quote a friend of mine: If I can't just trust the status quo I would have to question everything????

            • TeMPOraL 2 hours ago ago

              They're right, of course. You have to pick your battles. There's an art to that - and rules of that art are part of the lookup table too!

              #ItsSoMetaEvenThisAcronym

        • taneq 17 hours ago ago

          > Related thought: understanding and compression seem to be fundamentally the same thing.

          Both are fundamentally prediction. :)

      • giovannibonetti a day ago ago

        > You can model multiple-hop dependencies as a Markov chain by just blowing up the state space as a Cartesian product.

        Where the state space would be proportional to the token length squared, just like the attention mechanisms we use today?

        • AnotherGoodName a day ago ago

          It blows up at 2^n for Markov chains actually.

          Eg imagine input of red followed by 32bits or randomness followed by blue forever. Markov chains would learn red leads to blue 32bits later. They’d just need to learn 2^32 states.

          • fooker 18 hours ago ago

            Yep, the leap away from this exponential blowup is what has made LLMs possible.

            A few more leaps and we should eventually get models small enough to get close to information theoretic lower bounds of compression.

      • JohnHaugeland a day ago ago

        > but in theory Markov chains have enormous expressive power.

        as long as you don't care about the quality of what they're expressing. there's a reason they never did anything better than the postmodernism generator.

        putting paint in a cannon has enormous expressive power too, but if you aren't rothko, nobody's going to care

    • roadside_picnic a day ago ago

      > after an observation you are absolutely at a new state

      The essential property of a Markov chain is maintaining the Markov property:

      P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)

      That is the future, given the present state, is conditionally independent of the past states.

      It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.

      > The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

      Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.

      tl;dr most of the LLMs people use are effectively Markov Chains.

      • kgeist a day ago ago

        What is the conceptual difference between a "past state" and a "present state"? In a typical conversation with an LLM, you have past messages that remain in the KV cache when generating the next token (so as not to recalculate everything). Yet here, everything that came before is reinterpreted as part of the present state. In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain? I can't wrap my head around it. Can you give an example of a non-Markov system?

        • jltsiren a day ago ago

          The difference is arbitrary but fixed. For any given Markov model, there are inputs that can fully fit in the present state and inputs that can't. But for any given input, there are models that can encode it entirely in the present state.

          Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.

        • roadside_picnic a day ago ago

          > In other words, it seems as if you can take any past state and fold it into one large concatenated present state

          These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.

          > Can you give an example of a non-Markov system?

          Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.

          • nearbuy 15 hours ago ago

            If you include the encoder outputs as part of the state, then encoder-decoder LLMs are Markovian as well. While in token space, decoder-only LLMs are not Markovian. Anything can be a Markov process depending what state you include. Humans, or even the universe itself are Markovian. I don't see what insight about LLMs you and other commenters are gesturing at.

        • Imnimo a day ago ago

          A common distinction is that whatever you take to be your state space should be fixed-sized. So an LLM has a fixed context window, which might be extremely large, but with a long enough conversation, the oldest tokens will fall out of the context window, and your process will be non-Markovian. But with a large context size, most conversations will be small enough that they fit entirely within the context, and then the process is Markovian.

          So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.

          I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.

          • Imnimo a day ago ago

            Sorry, I realized I didn't quite write what I meant to. I didn't intend to say that LLMs are non-Markovian from a theoretical standpoint. I meant to say that the language generation task is non-Markovian from a theoretical standpoint, because the next word can depend on arbitrarily distant history.

            • vrighter 18 hours ago ago

              it can depend on whatever the oldest thing in the context window is, but not anything older

        • jampekka a day ago ago

          > In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain?

          Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.

        • fluoridation a day ago ago

          Hmm... For example, say you have a machine that must decide whether a monetary transaction is allowable or not. The state at any given time could be the balances of all accounts, plus some metadata about past states (but not the past states themselves). Then if A has $5 in their account and tries to transfer $5 to B, the contents of the metadata would be the deciding factor of whether the transaction goes through or not. The machine still only operates on the present state, but the present state depends indirectly on past states.

          • kgeist a day ago ago

            But a conversation with an LLM doesn't only depend on the last message, it depends on all previous messages. Sure the number of messages is limited by the context window, so you can indeed say it's a high-order Markov chain (it can only see N states back), but theoretically, what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

            • fluoridation a day ago ago

              >what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

              Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.

              • Veedrac 16 hours ago ago

                The point being made is equivalent to pointing out that by strict mathematical definition there aren't physically reliazable distinguishably-non-Markovian processes, because computers are finite objects. Yes, you can talk about Markov chains as referring to everything up to and including actual human brains, but the kind of Markov chains people actually refer to as Markovian in practice are much more specific than that, typically those systems with structurally simple, indexical nodes, but sometimes extending to more complex ideas like RNNs where the states are at least of finite bandwidth. An LLM not only has continuous-valued nodes, it has nodes whose description length grows proportionally to and theoretically-unboundedly with respect to the size of its history.

              • imtringued 11 hours ago ago

                >because there are processes that are neither LLMs nor Markov chains.

                I'm not sure what those would be. All dynamical systems (essentially all of classical physics) except maybe quantum mechanics are a Markov chain.

                • fluoridation 9 hours ago ago

                  Well, I gave an example already, so...

            • thaumasiotes a day ago ago

              You appear to be confusing the UI over the LLM with the actual working of the LLM. None of them allow for more than one message to exist. You put in one message and you get one word (well, one token) back.

              • kgeist a day ago ago

                I have an idea of how they work. At the end of the day, it's all just numbers. You say you put "one message", but in reality it's N input numbers corresponding to the tokens (where N can very from 1 to any size). Basically N variables. An LLM is one large function which takes N values as input and produces a new output value. So you can view an LLM as y = f(a, b, c, ...)

                What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?

                As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.

                • fluoridation a day ago ago

                  Let's keep it simple and say that the procedure of an LLM is

                    state: list[n];
                    while (true){
                        new_token = llm(state);
                        state.pop_front();
                        state.push_back(new_token);
                    }
                  
                  Any given time that llm() is called, it is blind to previous invocations. The state may contain evidence of them, or it may not. It's a free-form data structure that is sometimes filled with user input and sometimes contains previous llm() outputs, but llm() doesn't care.
                  • kgeist a day ago ago

                    So any pure function is Markovian then? They too only depend on their input and are blind to previous invocations. I’m just trying to understand the practical purpose of calling something a Markov chain beyond small sizes, because if we allow the state to be of any size, I’m not sure what the usefulness is: anything can be called a Markov chain if you concatenate all input data of any size into one blob and interpret it as "one state."

                    • fluoridation a day ago ago

                      >So any pure function is Markovian then?

                      I guess it would be more accurate to say that at its core a Markovian process is based around a pure function. Usually there's some kind of feedback loop around the function that allows the process to evolve in some way and continue producing output.

                      >I’m pretty sure anything can be modeled using pure functions.

                      We're not talking about theoretical models of black boxes, though, we're talking about known mechanisms. A recursive function doesn't become iterative just because it could be thought of as iterative (because eventually the CPU does loop over some sequence of instructions). We still say it's recursive.

                      >I’m not sure what the usefulness is

                      I suppose part of the usefulness is to de-mystify LLMs. If you can understand what Markov chains do, the thought "oh. An LLMs is just a much more complex Markov chain" can help reasoning about them.

                      • drdeca a day ago ago

                        “just”?

                        If you include the entire past history as part of the state, then any stochastic process becomes a Markov process.

                        Any computer program (without network stuff or inputs for a time (assume your computer has a true RNG module)) you could run is a Markov process. Just say that your state space is the space of possible ways the memory of the computer can be.

                        Saying that the LLM is “just” a Markov process seems to be doing something a bit weird with that “just”.

                        • fluoridation a day ago ago

                          >Any computer program (without network stuff or inputs for a time (assume your computer has a true RNG module)) you could run is a Markov process. Just say that your state space is the space of possible ways the memory of the computer can be.

                          No. The entire computer is a Markov process in that situation. An individual process may still remember its past states and therefore not be Markovian. See the example I gave about allowing or blocking transactions.

                          • drdeca a day ago ago

                            By “memory” I was including whatever is on disk, and the values of all the CPU registers, and all the CPU flags, etc.

                            So, yeah, the state of the computer (regarded as an abstract machine, not viewed at the level of the physical hardware)

                            Oh, you mean because I said a particular program, and other things on the computer could interfere with the program. Ok, point. I was imagining a computer that just had a single program running on it (with no separate OS) I guess? But yeah, good point, I certainly didn’t make my idea precise enough (in my own mind or in my words).

                            • fluoridation a day ago ago

                              Yes, I understood what you meant and my answer assumes that the contents of all memory -- volatile and persistent -- constitute the state of the machine.

                              It doesn't matter whether the program is the only thing the computer is doing or whether there are other things running on the same machine. The program considered as a machine unto itself is a conceptually different thing from the underlying hardware. The Markov property is a way to reason about how a process interacts with its state. "I could design a Markov process that's semantically equivalent to a non-Markov process, by just including more state" is a pointless statement. The whole point of the discrimination is to reason about the thing you're studying.

      • eigencoder a day ago ago

        Could you help me understand how decoder-only LLMs maintain the Markov property? If you used the same random seed, the input to the model "The cow jumped over the" would not give the same output as just "the", right? So isn't that violating the Markov property?

        • OJFord a day ago ago

          State (in this sense at least) isn't word/token parsing progress, it's comprising all the input and any context (which may include the entire chat history for example).

        • AnotherGoodName a day ago ago

          There would be need to a state specifically for “the cow jumped over the” (and any other relevant context) and states for all the other times ‘the’ is proceeded by something.

          This is the limitation i was getting at btw. In the example i wad getting at, if you have an image with solid vertical columns, followed by columns of random static, followed again by solid vertical colors a markov chain could eventually learn all patterns that go

          solid->32 random bits->different solid color

          And eventually it would start predicting the different color correctly based on the solid color before the randomness. It ‘just’ needs a state for every possible random color between. This is ridiculous in practice however since you’d need to learn 2^32 states just for relation ship between those two solid colors alone.

          • thesz a day ago ago

            > It ‘just’ needs a state for every possible random color between.

            You can use skipgrams - prefixes with holes in them.

            Sparse Non-negative Matrix Language Model [1] uses them with great success.

            [1] https://aclanthology.org/Q16-1024/

            The pure n-gram language models would have hard time computing escape weights for such contexts, but mixture of probabilities that is used in SNMLM does not need to do that.

            If I may, I've implemented an online per-byte version of SNMLM [2], which allows skipgrams' use. They make performance worse, but they can be used. SNMLM's predictive performance for my implementation is within percents to performance of LSTM on enwik8.

            [2] https://github.com/thesz/snmlm-per-byte

      • mjburgess a day ago ago

        They are Markov Chains, any one saying otherwise doesn't understand what a Markov chain is

        • lackoftactics a day ago ago

          Wouldn't tool calling, mcp break finite state?

          • jameshart 20 hours ago ago

            Tool calling is just part of the protocol you're using for how to handle states output by your Markov model. Same goes for chat input, for that matter.

            Chat interfaces are like taking a simple Markov generator, but with a rule where you say 'whenever it reaches a state ending in X, hand over decision making about state transitions to a human instead of a random number generator, until they move it to state ending in Y'.

            Tool calling is similar - 'when it reaches state ending in X, send the state data to a tool; use the result to drive a series of state transitions; then start generating again'.

        • hk__2 15 hours ago ago

          Authority arguments like "I’m right and if you disagree with me you don’t understand anything" are not exactly helpful.

          • mjburgess 9 hours ago ago

            It's not an authority argument, its the opposite. A warning about the supposed authority of people who would make that kind of claim.

            The validation of what i'm saying is little more than a few minutes google away.

            Some interventions can just be warnings, rather than lectures.

      • rcfox a day ago ago

        Is this just because LLMs don't have state?

        As far as I understand it, as you have a back-and-forth conversation with an LLM, you have to provide the entire history of the conversation plus your new response each time.

        • jampekka a day ago ago

          Stateful models, e.g. RNNs, are Markov models too. Sometimes "Markov chain" is used to refer specifically to models with no hidden state, e.g. (decoder-only) Transformers.

      • red75prime 16 hours ago ago

        > tl;dr most of the LLMs people use are effectively Markov Chains.

        ...when they are used without RAG and tools. x_n belongs to a set of cardinality 65536^100000 or so. Equating an LLM to a Markov Chain doesn't allow to make any non-trivial predictions about its behavior.

    • nadalishhh 5 hours ago ago

      Excellent point about the fundamental limitation of classical Markov chains' linearity. Your 2D bitmap example perfectly illustrates the key insight: traditional Markov chains are inherently constrained by their sequential, memoryless transitions.

      However, I'd argue that the distinction between classical n-gram Markov chains and modern transformers isn't as binary as it might appear. When we consider that transformers with context windows are mathematically equivalent to very high-order Markov chains (where the "state" encompasses the entire context), we see that the breakthrough wasn't abandoning the Markov property, but rather expanding the state representation exponentially.

      Your observation about inevitably trending toward attention-like mechanisms is particularly insightful. The exponential state explosion you encountered (needing 2^32 states for your example) is precisely why parameterized approaches like transformers became necessary - they provide a tractable way to approximate these massive state spaces through learned representations rather than explicit enumeration.

      The key innovation wasn't escaping Markov chains, but rather finding an efficient way to represent and compute over astronomically large state spaces that would be intractable for classical approaches.

    • felineflock 12 hours ago ago

      Have you tried to sequence the 2D bitmap in a Hilbert space-filling curve before applying the Markov chain?

    • 6r17 a day ago ago

      Would you say it's interesting to explore after spending much time on them ? Do you feel like one could make an use for it pragmatically within certain context or it's way too much of a toy where most of the time getting a service / coherent llm would ease-in the work ?

      • AnotherGoodName a day ago ago

        Yes. I think learning them and learning their limitations is the best way to learn neural networks actually.

        Give a class of students an image with horizontal lines where every second line is a solid color and every other is random static. See how their left to right markov chains do here (should make ~50% correct predictions).

        Then rotate the image 90degrees. Have the class observe a left to right markov chains gets 0% when predicting this (every second pixel being random will do that). What to do? Maybe input both ways and weight towards the best one with a perceptron? Hey first step to learning a neural network!

        From there you can iterate more and more until you no longer really have markov chains but instead neural networks with a type of attention mechanism.

    • cuttothechase a day ago ago

      Would having a Markov chain of Markov chains help in this situation. One chain does this when 2D bitmap patterns are vertical and another one for left to right?

      • AnotherGoodName a day ago ago

        Yes and then you weight between them with a neural network and your vertical predictor catches that every second vertical line is solid (while every other vertical line is static to mess up the horizontal markov chains). Of course then someone passes you video where there’s 3rd dimension. And you need to yet customise again with that consideration. Or maybe the pattern is in 45 degree diagonal lines and not horizontal or vertical. Better have a markov chains for that too. What about 10degree vertical lines? Etc.

        In the end you’re inputting a millions of ways there could be a pattern, passing all of those into a neural network and weighting the chains that make correct predictions more.

        You start to realize even with all these ways past context could still influence the current prediction and what you want is a generator for all the ways there could be a pattern. At this point you're getting into the realm of multilayer neural networks and starting to consider the attention mechanism.

        I don’t want to discourage anyone from learning markov chains here btw. It’s just that they have limitations and those limitations actually make a great learning journey for neural networks as you realize you really need more than an absolute singular state being activated at a time and then you start to think about how all the states activated in the past might influence the current probabilities (essentially you then start thinking about the problem the attention mechanism solves).

    • vrighter 18 hours ago ago

      yeah but that state can be as big as you want. say, for example, the state could be a vector of a million tokens...

      • procaryote 17 hours ago ago

        * as big as the context window

  • dfsegoat 17 hours ago ago

    Related:

    KingJamesProgramming [1] - a mashup of Knuth books [2] and the King James Bible with a Markov chain, is still one of my favorite reads for a laugh.

    It was also the first place I was really exposed to probabilistic-based methods for generation of novel content.

    You get gems like:

    "they smote the city with the edge of the sword. 22:20 And one of his main motivations was the high cost of proprietary software houses."

    and

    "37:29 The righteous shall inherit the land, and leave it for an inheritance unto the children of Gad according to the number of steps that is linear in b."

    [1] https://www.tumblr.com/kingjamesprogramming

    [2] https://en.wikipedia.org/wiki/Donald_Knuth

    • foofoo12 15 hours ago ago

      The latest post on KingJamesProgramming is hilarious:

      "then shall they call upon me, but I will not cause any information to be accumulated on the stack."

    • svat 10 hours ago ago

      Just stating for clarity: it is a mashup of the King James Bible and the book Structure and Interpretation of Computer Programs, which is not by Knuth (it is by Harold Abelson and Gerald Jay Sussman with Julie Sussman).

  • chankstein38 a day ago ago

    I once, probably 4-6 years ago, used exports from Slack conversations to train a Markov Chain to recreate a user that was around a lot and then left for a while. I wrote the whole thing in python and wasn't overly well-versed in the statistics and math side but understood the principle. I made a bot and had it join the Slack instance that I administrate and it would interact if you tagged it or if you said things that person always responded to (hardcoded).

    Well, the responses were pretty messed up and not accurate but we all got a good chuckle watching the bot sometimes actually sound like the person amidst a mass of random other things that person always said jumbled together :D

    • vunderba a day ago ago

      I had a similar program designed as my "AWAY" bot that was trained on transcripts of my previous conversations and connected to Skype. At the time (2009) I was living in Taiwan so I would activate it when I went to bed to chat with my friends back in the States given the ~12 hour time difference. Reading back some of the transcripts made it sound like I was on the verge of a psychotic break though.

    • emmelaich 21 hours ago ago

      I'm sure that Gilfoyle's prank in Silicon Valley had its basis in a true story, like most of SV's episodes. i.e. I'm sure this has been done quite a few times ever since Mark V. Shaney.

  • jcynix a day ago ago

    Ah, yes, Markov chains. A long time agoMark V. Shaney https://en.wikipedia.org/wiki/Mark_V._Shaney was designed by Rob Pike and posted on Usenet.

    And Veritasium's video "The Strange Math That Predicts (Almost) Anything" talks in detail about the history of Markov chains: https://youtu.be/KZeIEiBrT_w

    • emmelaich 21 hours ago ago

      Going way back, I believe https://en.wikipedia.org/wiki/Racter (1984) used Markov chains. The book "The Policeman's Beard is Half-Constructed" is hilarious, but as noted it was co-written by the author.

    • chilipepperhott a day ago ago

      It's a great explainer. Definitely better than mine.

      • proprietario a day ago ago

        fwiw, your explanation really made markov chains click for me (unlike other markov chain explanations) i have to admit, that the part where you went from the idea over hashmaps to the matrix representation was a bit dense, but rereading it twice made me under it so thanks for your explanation!

        • chilipepperhott a day ago ago

          That's great to hear. Thanks! I'm glad it worked.

  • ahmedhawas123 a day ago ago

    Random tidbit - 15+ years ago Markov chains were the go to for auto generating text. Google was not as advanced as it is today at flagging spam, so most highly affiliate-marketing dense topics (e.g., certain medications, products) search engine results pages were swamped with Markov chain-created websites that were injected with certain keywords.

    • gattilorenz 14 hours ago ago

      For auto-generating non-sensical texts and SEO spam; for actual text generation, planners and realizers were the to-go approach (and ~5 years later it was all LTSMs)

  • andai 12 hours ago ago

    This is most apparent with older models like GPT-2, and more generally base ("text") models like Davinci. They are essentially only trained to predict the next word, based on what they've seen on the internet.

    So they're a really fun way to try and "sample" the internet, by seeing the kind of outputs they produce.

    --

    A really fun thing I did with GPT-2 with a friend back when it came out, is we fine tuned the model on our WhatsApp conversation together.

    (The model was small enough to do this on mid range GPU, in a few hours.)

    So GPT-2 learned to simulate chats between us.

    It was hilarious, but also weirdly illuminating.

    My friend mostly talked about art (true), and didn't say much (true), and I rambled about how I finally figured out the One Weird Trick that's finally gonna turn my life around (...also true)...

    The text itself was largely incoherent, but the "feel" (statistical properties?) were spot on.

    • whycombinetor 10 hours ago ago

      Modern LLMs are also "essentially only trained to predict the next word, based on what they've seen on the internet."

      • andai 6 hours ago ago

        Well, sort of! The new ones all "sound like AI." They're not great for writing, especially in terms of style and tone.

        I'm not sure exactly how that works, but apparently instruct tuning produces mode collapse, whereas the older models are basically "raw unfiltered internet" (for better and worse).

        I've been playing around with Davinci and it's remakable how much better the writing is.

        You have to prompt it right, because all it does is continue the text. (E.g. if you ask it a question, it might respond with more questions.) But I think we really lost something valuable when we started neglecting the base/text models.

      • in-silico 5 hours ago ago

        Until they've undergone preference tuning and RL post-training (which is expected to start using more training compute than next-token-prediction pre-training).

  • glouwbug a day ago ago

    The Practice of Programming by Kernighan and Pike had a really elegant Markov:

    https://github.com/Heatwave/the-practice-of-programming/blob...

  • fluxusars a day ago ago

    Can't believe no one's mentioned this classic yet: https://www.tumblr.com/kingjamesprogramming

    Back in the early days of Slack I wrote a chat bot that logged all the messages people wrote and used that data to make it say random things prompted by certain trigger words. It was hilarious, until the bosses found out that I was basically logging all their conversations. Unsurprisingly they made me turn it off.

    • airstrike a day ago ago

      LMAO thanks for that link, it's hilarious

  • taolson a day ago ago

    If you program a Markov chain to generate based upon a fairly short sequence length (4 - 5 characters), it can create some neat portamenteaus. I remember back in the early 90's I trained one on some typical tech literature and it came up with the word "marketecture".

    • ronsor a day ago ago

      That sounds like a corporate buzzword generator

  • mwcz a day ago ago

    This is great. I use Markov chains to describe to normies how LLMs work.

    It's very interesting reading what Claude Shannon wrote after producing Markov chains from English sentences.

    > In an unpublished spoof written a year later, Shannon imagined the damage his methods would do if they fell into the wrong hands. It seems that an evil Nazi scientist, dr. Hagen Krankheit, had escaped Germany with a prototype of his Müllabfuhrwortmaschine, a fearsome weapon of war "anticipated in the work ... of Dr. Claude Shannon." Krankheit's machine used the principles of randomized text to totally automate the propaganda industry. By randomly stitching together agitprop phrases in a way that approximated human language, the Müllabfuhrwortmaschine could produce an endless flood of demoralizing statements.

    From A Mind at Play. I don't remember the exact year he did that work. I think it was some time before 1950, yet strangely timely in 2025.

    • lou1306 15 hours ago ago

      This is amazing, in no small part due to the fact that Krankheit literally means "disease" in German.

  • dugmartin a day ago ago

    If you would like to visually play with a Markov chain model my non-profit research org built a Markov plugin for, CODAP, our data analysis tool:

    https://codap.concord.org/app/static/dg/en/cert/index.html?d...

    If you select drawing mode from the first screen shown and then click the "T" icon you can type some text and it will generate a state diagram for you that you can then "play" and examine the output sequences. If you have states that have multiple exit routes you can click on that state and adjust the probabilities of each option.

  • allthatineed a day ago ago

    I remember playing with megahal eggdrop bots.

    This was one of my first forays into modifying c code, trying to figure out why 350mb seemed to be the biggest brain size (32 bit memory limits and requiring a contiguous block for the entire brain).

    I miss the innocence of those days. Just being a teen, tinkering with things i didn't understand.

    • vunderba a day ago ago

      I remember reading the source of the original MegaHAL program when I was younger - one of the tricks that made it stand out (particularly in the Loebner competitions [1]) was that it used both a backwards and forwards Markov chain to generate responses.

      [1] https://en.wikipedia.org/wiki/Loebner_Prize

    • foobarian a day ago ago

      I'm old now, but thanks to LLMs I can now again tinker with things I don't understand :-)

      • jcynix a day ago ago

        The nice thing about LLMs is that they can explain stuff so you can learn to understand. And they are very patient.

        For example I'm currently relearning various ImageMagick details and thanks to their explanations now understand things that I cut/copy/pasted a long time ago without always understanding why things worked the way they did.

      • codr7 a day ago ago

        Are you though? Or is the LLM the target of your tinkering and lack of understanding? Honest question.

    • anthk a day ago ago

      Hailo it's still alive under CPAN.

  • benob a day ago ago

    Like it or not, LLMs are effectively high-order Markov chains

    • BenoitEssiambre a day ago ago

      Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.

    • krackers a day ago ago

      Only in the way that every real-world computer is a finite state machine.

      • bigfishrunning a day ago ago

        Every real-world computer is a finite state machine, there's just a lot of states. I guess I'm unsure of the point that you're trying to make here

        • red75prime 16 hours ago ago

          The point? It's an almost useless way of looking at things.

    • guluarte a day ago ago

      markov chains with limited self correction

  • robotnikman a day ago ago

    I remember creating a chatbot for my minecraft server over a decade ago that used Markov chains. It was funny seeing the crazy things it spat out and the other players enjoyed it.

  • bjornsing 14 hours ago ago

    I’d say an LLM is a form of Markov chain, where the state is defined to consist of the entire context window, and each state transition adds one token to that context window.

  • lormayna 10 hours ago ago

    Almost 10 years ago I created couple of Twitter bots that can tweet like the leaders of two populisti Italian party. The quality of the tweets was sometimes not the best, but usually decent. It was impressive how many people starts following them in few weeks.

  • Atatator 5 hours ago ago

    As my scientific advisor used to say: "Markov has nothing to lose but his chains!"

  • fair_enough a day ago ago

    On a humorous note OP, you seem like exactly the kind of guy who would get a kick out of this postmodern essay generator that a STEM professor wrote using a Recursive Transition Network in 1996:

    https://www.elsewhere.org/journal/pomo/

    Every now and again, I come back to it for a good chuckle. Here's what I got this time (link to the full essay below the excerpt):

    "If one examines subcultural capitalist theory, one is faced with a choice: either reject subdialectic cultural theory or conclude that the significance of the artist is significant form, but only if culture is interchangeable with art. It could be said that many desituationisms concerning the capitalist paradigm of context exist. The subject is contextualised into a presemanticist deappropriation that includes truth as a totality."

    https://www.elsewhere.org/journal/pomo/?1298795365

    • 1718627440 a day ago ago

      I think that's how source code sounds to non-programmers. Too far removed from the meaning two distinguish, what makes sense and what is garbage.

    • chilipepperhott a day ago ago

      That essay generator is pretty cool. There are definitely a lot of undergrads whose essays look about the same.

      • fair_enough a day ago ago

        There is no dearth of midwittery in the non-STEM fields. I admit this is elitist and downright snobbish, but I am disgusted that somebody else can get a degree from the same university as me in a non-STEM major and piggyback off of prestige they didn't create.

        In my opinion, it's stolen valor.

        • procaryote 16 hours ago ago

          Perhaps you should value knowledge over status?

          • fair_enough 8 hours ago ago

            It doesn't matter what I value. The people who wanna hire me for jobs or invest in my fund value status.

            There's a reason companies hire people they don't need: to make themselves look bigger to investors and in the case of law firms, clients.

            Analyzing complex systems from first principles and challenging decisions that have no good reason is a noble cause, but it's only useful if you have the authority to make the changes. I'm not dictator of the world, I don't get to dictate how customers, clients, or employers shall make their business and hiring decisions.

  • kragen a day ago ago

    A first-order Markov-chain text generator is not complex at all; it's a one-line shell script, reformatted here onto three lines for clarity:

        perl -ane 'push@{$m{$a}},$a=$_ for@F;END{
          print$a=$m{$a}[rand@{$m{$a}}]," "for 1..100_000
        }' <<<'fish fish for fish.'
    
    Given the King James Bible instead of "fish fish for fish." as input, this program produces output like this:

    25:13 As the hands upon his disciple; but make great matter wisely in their calamity; 1:14 Sanctify unto the arches thereof unto them: 28:14 Happy is our judges ruled, that believe. 19:36 Thus saith the first being one another's feet.

    This leans pretty hard on Perl DWIM; a Python version is many times longer:

        import random, sys
    
        model, last, word = {None: []}, None, None
        for word in (word for line in sys.stdin for word in line.split()):
            model.setdefault(last, []).append(last := word)
    
        sys.stdout.writelines(str(word := random.choice(model.get(word) or words)) + " "
                              for words in [list(model.keys())] for i in range(100_000))
    
    (If you write code like that at a job, you might get fired. I would reject your pull request.)

    It's probably worth mentioning:

    - Markov-chain states don't have to be words. For text modeling, a common thing to do is to use N-grams (of either words or characters) instead of single words for your states.

    - It's good to have some nonzero probability to take a transition that hasn't occurred in the training set; this is especially important if you use a larger universe of states, because each state will occur fewer times in the training set. "Laplacian smoothing" is one way to do this.

    - Usually (and in my code examples above) the probability distribution of transitions we take in our text generator is the probability distribution we inferred from the input. Potter's approach of multiplying his next-state-occupancy vector Ms⃗ by a random diagonal matrix R and then taking the index of the highest value in the resulting vector produces somewhat similar results, but they are not the same; for example

        sum(random.random() * 0.75 > random.random() * 0.25 for i in range(1_000_000))
    
    is roughly 833000, not roughly 750000. It's 5× as common instead of 3×. But I imagine that it still produces perfectly cromulent text.

    Interestingly, modern compilers derive from Chomsky's theory of linguistics, which he formulated while working on an AI project in the 01950s in order to demolish the dominant theory of psychology at the time, behaviorism. Behaviorism essentially held that human minds were Markov chains, and Chomsky showed that Markov chains couldn't produce context-free languages.

    • chilipepperhott a day ago ago

      This is actually sick! I have to admit, my implementation is far more (likely unnecessarily) complex.

      • kragen a day ago ago

        Thanks! I think! Your implementation might be easier to understand and modify, too, although perhaps not because of being unnecessarily complex.

    • 1718627440 a day ago ago

      What????????? Given that, how was it not obvious for decades, that this devolves into something like the current LLMs? I give it only the first few paragraphs of a random man page and it already passes for a human, that tries to form nonsense sentences, while having mostly correct grammar, but not always. LLMs now feel not very technically advanced. The code isolation in their interface seams to be more impressive than this.

      • kragen a day ago ago

        If you only give it a few paragraphs, most words only occur once, so it ends up copying whole phrases from the input until it hits a common word. Computers are very good at sounding like human writing when they're just reproducing things that humans in fact did write. LLMs mostly aren't doing that, though.

        • 1718627440 a day ago ago

          You're right, I was a bit too enthusiastic. I've feed it random C code (both normal and already preprocessed) and it produces random garbage, not remotely passable for syntax. Also it seems to really like emitting parts of copyright messages, since they occur verbatim everywhere.

          Happy part of today's 10'000.

      • imtringued 9 hours ago ago

        If there was an array programming language with automatic differentiation and a focus on writing neural networks, I'm sure you could write a transformer based LLM including how to train it on a napkin.

        • kragen 9 hours ago ago

          I think reverse-mode automatic differentiation is significantly harder to implement than a string-indexed hash table and delimiter splitting, but maybe that's not important when those are such a small part of an interpreter for Perl or even Awk?

          How big is a Transformer in TensorFlow in Python?

          • kragen 5 hours ago ago

            (I guess I should have pointed out that forward-mode automatic differentiation was only about 100–150 lines of code when I implemented it in http://canonical.org/~kragen/sw/81hacks/autodiffgraph/, depending on where you draw the lines. But gradient descent isn't practical with forward-mode autodiff unless it's with a very small number of independent variables. Also even 100–150 lines of code is significantly bigger than a hash table and delimiter splitting.)

    • 3abiton a day ago ago

      I brushed with MCMC decades ago, and your code is truely beautiful

      • kragen a day ago ago

        MCMC is considerably more complicated, though!

  • cestith a day ago ago

    I’ve been telling people for years that a reasonably workable initial, simplified mental model of a large language model is a Markov chain generator with an unlimited, weighted corpus trained in. Very few people who know LLMs have said anything to critique that thought more than that it’s a coarse description and downplays the details. Since being simplified is in the initial statement and it’s not meant to capture detail, I say if it walks like a really big duck and it honks instead of quacking then it’s maybe a goose or swan which are both pretty duck-like birds.

    • nerdponx a day ago ago

      It's not a Markov chain because it doesn't obey the Markov property.

      What it is, and what I assume you mean, is a next-word prediction model based solely on the previous sequence of words, up to some limit. It literally is that, because it was designed to be that.

      • qwelfkh a day ago ago

        Okay, so we have a function that converts a buffer of tokens to a distribution of next tokens. We give the function a buffer, get back the next-token-distribution, from which we sample to get our next token. We append that to the buffer (and possibly roll off the first token) to obtain a new buffer. If we consider entire buffers to be states (which is a perfectly reasonable thing to do), then we have a stochastic process that moves us from one state to another. How is that not a Markov chain?

        • nerdponx a day ago ago

          That's fair, and yes it is a Markov chain if you frame it that way. Albeit with the interesting property that the state is very very high dimensional, and the current and previous states never differ by more than 1 token.

      • cestith a day ago ago

        What about a language model requires knowledge of previous states to choose the next state based on the current state? Are you asserting that only the last word chosen is the current state? Couldn’t the list of symbols chosen so far be the current state, so long as no data is carried forward about how that list has been generated?

        • coliveira a day ago ago

          You're right, the recent memory of the LLM is the state, that's why it needs to be so deep to be effective compared to a traditional MC.

    • hatthew a day ago ago

      Anything can be a markov chain if you define "memory of the past" as a component of the current state and include a big enough memory to hold all of the past for any practical amount of time. Personally, I feel like in most contexts it's not helpful to think of LLMs that way, since their state space (context length) is so large that they are extremely far removed from what we normally think of as markov chains. One might as well define humans as markov chains, since the brain has a finite memory.

      • cestith 7 hours ago ago

        Memory of the past implies the current state includes information about how the buffer got the way it is. I’m not defining anything in that way. I’m saying that if you have a buffer of N tokens, the next token N+1 can be chosen based on N tokens, not 1 token and not on the calculations or past state changes that went into creating the current state of the buffer.

    • rytill a day ago ago

      So, I have heard a number of people say this, and I feel like I'm the person in your conversations saying it's a coarse description and downplays the details. What I don't understand is, what specifically do we gain from thinking of it as a Markov chain.

      Like, what is one insight beyond that LLMs are Markov chains that you've derived from thinking of LLMs as Markov chains? I'm genuinely very curious.

      • jltsiren a day ago ago

        It depends on if you already had experience in using large Markov models for practical purposes.

        Around 2009, I had developed an algorithm for building the Burrows–Wheeler transform on (what was back then) very large scale. If you have the BWT of a text corpus, you can use it to simulate a Markov model with any context length. It tried that with a Wikipedia dump, which was amusing for a while but not interesting enough to develop further.

        Then, around 2019, I was working in genomics. We were using pangenomes based on thousands of (human) haplotypes as reference genomes. The problem was that adding more haplotypes also added rare variants and rare combinations of variants, which could be misleading and eventually started decreasing accuracy in the tasks we were interested in. The standard practice was dropping variants that were too rare (e.g. <1%) in the population. I got better results with synthetic haplotypes generated by downsampling the true haplotypes with a Markov model (using the BWT-based approach). The distribution of local haplotypes within each context window was similar to the full set of haplotypes, but the noise from rare combinations of variants was mostly gone.

        Other people were doing haplotype inference with Markov models based on similarly large sets of haplotypes. If you knew, for a suitably large subset of variants, whether each variant was likely absent, heterozygous, or homozygous in the sequenced genome, you could use the model to get a good approximation of the genome.

        When ChatGPT appeared, the application was surprising (even though I knew some people who had been experimenting with GPT-2 and GPT-3). But it was less surprising on a technical level, as it was close enough to what I had intuitively considered possible with large Markov models.

    • jama211 a day ago ago

      Sure, but arguably by that definition so are we ;)

      • lackoftactics a day ago ago

        There is no free will, only determinism

    • imtringued 9 hours ago ago

      This is pretty stupid because the equation for self attention is pretty trivial to the point where it makes sense to just go straight into the math and not play pretend games.

      softmax(QK^T/sqrt(dk))V

      Q=W_QX K=W_KX V=W_VX

      X is a matrix where every column is the embedding of a token.

      K and V matrices need to be cached.

      Also, you need to calculate all of it, for every layer, even if you just want a single token of output. The tokens after the first token can reuse the K, V embedding matrices and don't have to recalculate everything for on scratch. Causal attention is the concept of masking the calculation so you only have to look at past tokens.

      My critique boils down to this: Transformers are too simple to simplify them. You can only lose information if you do that.

    • Uehreka a day ago ago

      The problem is that it doesn’t stay simplified; once you leave the room those people start using that model to explain details of actual LLM behavior. I’m constantly arguing with people who use this mental model to explain why LLMs can’t do things that they absolutely can do.

      I frequently see people underestimate the danger of LLMs to humanity in the long term and to people’s jobs in the medium term because they follow this chain of reasoning:

      - An LLM is basically a fancy markov chain (highly dubious even if there’s a tiny kernel of truth in there) - Therefore markov chains and LLMs have a similar skill ceiling (definitely wrong) - My job could never be done by a markov chain, it’s much too complex (this is presented self-evidently, no one ever feels the need to back this up) - Therefore, since LLMs are basically markov chains, and a markov chain could never do my job, I have nothing to worry about.

      I’ll grant that you’re not necessarily responsible for these people using your model in a way you didn’t intend. But I do think at this point it’s time to start pushing people to stop trying to reason mechanically about how these things work.

      • asdlfaglkj a day ago ago

        1. An LLM is absolutely a Markov chain. That is not meant to dismiss that they are a remarkable feat of generating and compressing the representation of really cool Markov chains.

        2. Because they are Markov chains, their skill ceiling is bounded by whatever the skill ceiling of Markov chains happens to be. What LLMs demonstrate is that the skill ceiling of Markov chains is higher than previously understood.

        3. If the skill ceiling of Markov chains is high enough, then one could take over some or all of someone's job.

        • drdeca 18 hours ago ago

          I think there’s an equivocation being accidentally made between n-gram models, and Markov processes, because “Markov chain” is used to mean both things.

          N-gram models are not useful in many of the ways LLMs are.

          N-gram models are very limited.

          On the other hand, basically any process can be considered a Markov process if you make the state include enough.

          So, calling both the very-limited n-gram models and the nearly-unlimited Markov processes, by the same name “Markov chain” is just, super confusing.

      • Skgqie1 12 hours ago ago

        > I frequently see people underestimate the danger of LLMs to humanity in the long term

        My experience with people making claims about the inherent danger has been along the lines of this later quote:

        > this is presented self-evidently, no one ever feels the need to back this up

  • Nevermark a day ago ago

    Nice little example.

    Could add a "stop" token at the end of the fruit example, so when the little model runs it can call its own stop.

    To generalize to ADHD, create another model that can quietly "Listen" to a speaker (token generator) until its Markov chain predicts a "Stop", and starts its own generating model running, regardless of whether the speaker was done.

    Then two instances can have time efficient interrupty conversations with each other.

  • procaryote 17 hours ago ago

    Aren't transformers markov chains mathematically? You couldn't practically implement them as naive markov chains as the state space is too big, but the math is the same, isn't it?

    • lou1306 15 hours ago ago

      Yes, in the sense that they can be formalised as such. Given n tokens t1..tn, a transformer computes the probability p(t') that every token t' has of being the next. Each of these probabilities are arcs from a state t1..tn to another state t1..tn,t' weighted by p(t').

      And since the context window is bounded, this is guaranteed to be finite-state, and all states are transient except those where the window is saturated.

  • Aperocky a day ago ago

    Shameless plug for playing with random markov chain with seed texts: https://aperocky.com/markov/

    Source: https://github.com/Aperocky/weighted-markov-generator

  • traes a day ago ago

    There seems to be a mistake in the "Building the Transition Matrix" section of this article. Instead of showing the diagonal matrix D with the normalization coefficients it instead shows the normalized Markov matrix M, incorrectly claiming it is equal to the unnormalized matrix C.

  • Doch88 a day ago ago

    I remember coding IRC bots with Markov Chains that were repeating random nonsense from the chat, so many memories

  • saylisteins a day ago ago

    There is an interesting video about this from Veritasium

    https://www.youtube.com/watch?v=KZeIEiBrT_w

  • mangomountain a day ago ago

    When I first heard about chat gpt I thought it was a markov chain generator and totally dismissed it lol woops

  • jedberg a day ago ago

    For funsies one of the early reddit engineers wrote a Markov comment generator trained on all the existing comments.

    It worked surprisingly well. :)

  • stared 16 hours ago ago

    Markov models do not extrapolate. If we take intuitions from HMMs and use them in LLMs, we are prone to see nothing besides "stochastic parrots".

    Anyway, for anyone who wants to go deeper, I recommend Lawrence R. Rabiner, "A tutorial on Hidden Markov Models and selected applications in speech recognition", https://www.cs.cmu.edu/~cga/behavior/rabiner1.pdf

    And for a cure example, there is also one in nothin else than Claude Shannon, "A Mathematical Theory of Communication", https://web.archive.org/web/20121108191018/https://www.alcat...

  • calf a day ago ago

    So I think this "talking point" of "LLMs are just Markov chains" is as bad, analogously, as "Your amazing invention that passes the Turing Test is justa giant LUT". Or "Your C++ program is justa FSM". It is all technically true, but it is being used subtly to dismiss/delegitimize something, instead of communicate a scientifically insightful idea. How is this different than "Humans are just collections of cells". True except we're are a very interesting type of collection! LLMs are a very interesting type of Markov chain! Haskell/Java/Ruby programs are a very interesting type of FSM! Etc. The talking point by itself is just reductionism, of which we've seen and heard variants of during other scientific revolutions.

    Also, this Markov chain has size O(T^K), exponential in the token length of the context window, which is a beyond astronomically large object

  • Spooky_Fusion1 a day ago ago

    Great article. I have been hitting on 'High School's levels including 8 dimensional , dual quaternion, Dirac's Eq. Algebra. On my website. Mainly as they give a view of the dynamics of non trivial unitary matrix expansions that can occur in such as Zeta Zero, element nuclei, prime numbers, energy levels. For light elements. For de located electrons in sync, the Markov chains are interesting, as they relate to cellular automata, the taking of limits for (relativity, electrons moving with charge potential).Having not done high school algebra, until latter years. I have learnt to not assume, a particular algebra fits all dynamic non trivial matrix process's. Ie quantum computing, long range sensors, gravity waves, pressurised plasma at phase, ECT. So great article as it shows a practical high school level of applied knowledge, that doesn't revolve around solely unitary matrices. In as much as its potential dynamic (linear expansion)evolved states ! To which the high school level of visualisation can become aware! (I just thought I'd add this on legal probabilities. There is a case involving a SASR trooper 'Schultz' that revolves around shooting a Mulha's spotter. In Australia magistrates seem to judge cases based on 4 dimensional probabilities. And this seems to tie in with Jeffery Epstein's management through cohesion of President Clinton's pattent attorney change from an solely engineer to a lawyer based pattent granting. Epstein's little girl problem wasn't solely sexual, but given stock market prediction dynamics, more like a dynamic non trivial unitary probability problem of visualising size. Something magistrates in Australia seem to have adopted in the determining of complex military cases.). Warren

  • slashdave a day ago ago

    I don't think the author understands what a Markov chain is

    • jaapz 15 hours ago ago

      Be the first to explain it to them properly then

  • Ianjit 16 hours ago ago

    I am at stage 4 as well. Boredom.

  • leecming a day ago ago

    markov chains are a historical curio for NLP - easy to understand and implement, but just not good enough to do anything useful with. the jump to word vectors (e.g., word2vec) + CNNs/RNNs arguably put the field in a place where you could build useful language tools

    • hiddencost a day ago ago

      They're used heavily in information retrieval at scale. 3gram counting features especially.

  • westurner 10 hours ago ago

    Markov chain > History: https://en.wikipedia.org/wiki/Markov_chain#History

    Examples of Markov chains: https://en.wikipedia.org/wiki/Examples_of_Markov_chains

    A Hopfield network is an RNN Recurrent Neural Network.

    Hopfield network: https://en.wikipedia.org/wiki/Hopfield_network

    From "Ask HN: Parameter-free neural network models: Limits, Challenges, Opportunities?" (2024) https://news.ycombinator.com/item?id=41794272 re: neural network topologies :

    > The Asimov Institute > The Neural Network Zoo: https://www.asimovinstitute.org/neural-network-zoo

    > PNG: "A mostly complete chart of neural networks" (2019) includes Hopfield nets!" https://www.asimovinstitute.org/wp-content/uploads/2019/04/N...

    > [ Category:Neural network architectures, Types of artificial neural networks ]

  • pessimizer a day ago ago

    Loved playing with Babble! when I was a teenager: https://archive.org/details/Babble_1020

    You could load it up with anything, and it would just babble on.

  • brador 14 hours ago ago

    RNG, then Bayesian, then vectors, then markovs, then tensors is the temporal research order of human like sentence generation.

    Exciting to see what’s next but not much breakthrough remaining in the space.

  • anthk a day ago ago

    Also for perl user:

          curl -L https://cpanmin.us/ -o cpanm
      
          ./cpanm -n local::lib
          ./cpanm -n Hailo
    
          ~/perl5/bin/hailo -t file.txt -b file.brn
    
          ~/perl5/bin/hailo -b file.brn
    
    file.txt must be a text file with one sentence per line.