Support for the TSO memory model on Arm CPUs (2024)

(lwn.net)

32 points | by weinzierl 19 hours ago ago

22 comments

  • slabtickler 14 hours ago ago

    imo the fragmentation argument is very flaky. this is not like, e.g. the old ARM big endian mode which accidentally created a separate ISA for really dubious reasons. the only thing that really have a need for alternate store orders on ARM at this point are just (x86) emulators. This argument may have made more sense 10, 15 years ago but ARM has become so ubiquitous that the idea of “oh my port of this originally x86 userland software crashes on ARM due to a synchronization bug, better use this TSO thing as a get-out-of-jail-free card “ really doesn’t have much water to it now

    • gpderetta 6 hours ago ago

      There is still interest in emulating x86 binaries on ARM. See the efforts from valve for example.

  • rbanffy 18 hours ago ago

    We need a TLA authority to help prevent collisions in the acronym space. It’s enough that MCP is also the Burroughs/Unisys mainframe operating system, now TSO is also the time-sharing option on IBM mainframes.

  • dmitrygr 18 hours ago ago

    The focus on user space fragmentation is wrong, IMHO.

    One of the maintainers (Catalin Marinas) made [0] a much more important point: Apple makes no promises about how their "TSO" bits work now or will work in the future. This mode was designed for Rosetta2, not the general public. It is not documented formally. Someone saying "it is TSO" is not documentation. A formal definition of a memory model is usually a very long document describing a lot of corner cases, for example [1] is a SUMMARY of the ARMv8 memory model, it is 31 pages long. It is a summary! The full spec makes up chapters D7 and D8 in [2], totaling 243 pages. Even there, there are corners that it does not touch on and people get wrong. Without such a spec for Apple's TSO mode, how can anyone rely on how it might or might not work?

    Additionally, you might find silicon bugs if you do something in this mode that Rosetta2 doesn't or didn't. Consider that the only first-party user of this mode was Rosetta2. Anything it does not do that you do might find a bug.

    The stated linux kernel policy of "do not break user space" is impossible to deliver on, if built on an undocumented hardware feature that might change at any time and was never fully publicly specified. The maintainers are right to reject this.

    [0] https://lwn.net/ml/linux-kernel/ZiKyWGKTw6Aqntod@arm.com/

    [1] https://developer.arm.com/-/media/Arm%20Developer%20Communit...

    [2] https://documentation-service.arm.com/static/6943ef0c7982093...

    • spijdar 15 hours ago ago

      I'd posit that Linux having a hard policy of "do not break user space" has always just been "mostly true", especially around architecture specific stuff, and stuff tied to corporations.

      I bring this up only because recently, I made it my mission to get IBM's "PowerVM Lx86" dynamic translation system running on a POWER9 system running modern Debian Linux.

      This required a lot of hackery I won't go into here, but it revealed two things to me:

      1. The lx86 system depended on an undocumented (??) pseudo-syscall to switch between big and little endian modes. This was "syscall 0x1EBE", which was implemented in the exception handler for PowerPC. In other words, it wasn't a real syscall, and tools like strace do NOT capture/log it. It was a conditional branch in the assembly entry point for the process exception handler which switches endianness and then returns. Quicker than a "real" syscall. Also, long gone in the Linux kernel, replaced with syscall switch_endian, hex 0x16B. Adding this in wasn't too hard, but it'd sure as heck never make it upstream again ;)

      2. A lot of other Linux calls have had bits added which break old applications that are "rigidly coded". For example, netlink has had extra data stuffed into various structures, which code like lx86's translator panics on. To get network stuff running, even with an antique x86 userland, required removing a bunch of stuff from the kernel. Not disabling it via configuration, but deleting code.

      All this to say, there is a precedent for breaking the Linux user facing ABI for hardware features like this. I'm not saying that's a good thing, but it is a thing.

    • saagarjha 15 hours ago ago
    • GeekyBear 18 hours ago ago

      As mentioned in the article, TSO is not exclusive to Apple's ARM implementation.

      > Some NVIDIA and Fujitsu CPUs run with TSO at all times; Apple's CPUs provide it as an optional feature that can be enabled at run time.

      • dmitrygr 17 hours ago ago

        > As mentioned in the article, TSO is not exclusive to Apple's ARM implementation.

        I thought I had been quite clear. I guess I'll try again even more clearly.

        "TSO" is three letters. It is not a spec. "We all do TSO" is as meaningful as "we all want world peace". Everyone has their own meaning for those words, and the meanings may differ significantly. Each is a memory model, and each can be called "TSO". But just like not every "John Smith" is the same person, nor is everything called "TSO" the same. Does NVIDIA's TSO order ALL reads with respect to ALL writes? Does Apple's? What does x86 do in that case? What does a Fujitsu CPU do? "TSO" does not mean the same thing to everyone just like "world peace" does not. If, for example, NVIDIA came out and said "our TSO mode complies 100% with x86 memory model and will always continue to", then Fujitsu did the same, and then (LOL) Apple also publicly promised that, then and only then would your comment make sense. As it stands, four entities use the same acronym to each mean their own thing, and you are assuming absolute equality because the three letters match.

        Fun story: I know FOR A FACT the answer to my above question about ordering of all reads vs all writes is not the same for x86, Apple's TSO, NVIDIA's TSO, and Fujitsu's TSO. Do you? Do you know how? Do you know how the answers might change with time and hardware revisions, given that at least Apple made no promises as to how their undocumented TSO mode works today or will work tomorrow? Exactly...

        One cannot build a stable f{ea,u}ture on undocumented un[der]specified hardware features.

        • Dylan16807 17 hours ago ago

          > I know FOR A FACT the answer to my above question about ordering of all reads vs all writes is not the same for x86, Apple's TSO, NVIDIA's TSO, and Fujitsu's TSO.

          Well of course they differ. TSO says that some reorderings are banned and some are optional, and there's a million factors that go into deciding when those options are taken.

          > "TSO" is three letters. It is not a spec.

          It's a few rules that you can depend on. Are those rules not enough to build a program on top of? The simpler you make your rules, the less spec you need. On the other end of the spectrum, a dozen specialized memory barriers need a ton of explanation.

          • dmitrygr 17 hours ago ago

            >> "TSO" is three letters. It is not a spec.

            >It's a few rules that you can depend on.

            Until properly specified they are not "rules" but "hopes". Apple made no promises and provided no specs for their TSO mode. What makes you sure that that TSO bit on AppleM4pro acts the same as on AppleM1? That same "TSO" bit might mean yet a third thing on AppleM7megaMaxProEliteG2 in 2031. How do you know that an OS update that also updated iBoot on your Mac did not change some internal chip config MSR and now even on your AppleM4pro CPU whose TSO you understood, it acts differently due to this config bit change?

            • Dylan16807 17 hours ago ago

              I wasn't talking about Apple's promises, I was talking about the meaning of "TSO". If you know you have TSO, you have some rules you can depend on. What's an example of something you need beyond those rules, to write correctly concurrent code?

              • dmitrygr 17 hours ago ago

                > If you know you have TSO

                "If you know you have world peace"

                Sure, now define "total". Which accesses does that affect and which ones does it not? Is device memory included? PCIe memory? Are there ordering guarantees between mappings with different permissions?

                Then, define "store ordering". Does it affect loads in any way? Or simply just stores?

                • Dylan16807 17 hours ago ago

                  > Sure, now define "total". Which accesses does that affect and which ones does it not? Is device memory included? PCIe memory? Are there ordering guarantees between mappings with different permissions?

                  At a basic level TSO is a model for how cores interact and devices are weird, so I'd say those get to be unspecified.

                  And ideally you want a line saying if the instruction cache needs to be flushed for self-modifying code since that's kind of a violation if not specified but it's a forgivable one.

                  > Then, define "store ordering".

                  Sure, though I'm not promising my wording is perfect: In TSO, when stores complete they become visible to all other cores and all cores agree on the exact same list of completed stores.

                  > Does it affect loads in any way? Or simply just stores?

                  Depends on what you mean by "affects". Loads in one core might not see stores from another core that have not yet reached the global/total list.

                  • slabtickler 14 hours ago ago

                    just speaking honestly i would not consider I$ snooping as part of the definition of TSO. it is part of the x86 memory model yes but “TSO” does not define the full story here

                  • dmitrygr 16 hours ago ago

                    > when stores complete they become visible to all other cores and all cores agree on the exact same list of completed stores.

                    Not that they agree on what completed but on the order they completed in. That is the "o" in TSO. You inadvertently proved my point.

                    .

                    > so I'd say those get to be unspecified.

                    * CRASH *

                    You left something unspecified that mattered. Ordering of accesses to mappings with differing permissions matters, and whether they are seen in-program-order or not by other cores will break x86 emulators (main use cases for TSO).

                    .

                    That's the point here :) This is the usual "i am sure we can all agree what X means" argument - it does not work when it comes to precise things like memory models.

                    • Dylan16807 15 hours ago ago

                      > Not that they agree on what completed but on the order they completed in. That is the "o" in TSO. You inadvertently proved my point.

                      A list is ordered. You're trying too hard to nitpick. (Also I gave a disclaimer that my wording wasn't perfect, and it only took a couple words for you to "fix" it. If it can be fixed that easily then that doesn't actually counteract my point.)

                      > You left something unspecified that mattered. Ordering of accesses to mappings with differing permissions matters, and whether they are seen in-program-order or not by other cores will break x86 emulators (main use cases for TSO).

                      How many x86 emulators have the emulated code talking directly to hardware, to the same piece of hardware, from multiple cores at the same time?

                      I don't think this is a "main use case".

                      Plus there's going to be a baseline for how talking to the hardware works. Only TSO-mode-specific details of the hardware access are left unsaid in this basic model, and many access patterns fitting the above description still won't notice anything one way or the other.

                • gpderetta 6 hours ago ago

                  > define "store ordering". Does it affect loads in any way? Or simply just stores

                  It affects the visible ordering of remote stores to normal memory, so load are necessarily affected (it wouldn't make sense to guarantee a store order if unobservable).

                  Really, TSO is defined independently of x86 and in fact it took a while to actually prove that x86 was TSO. Concretely, how do architectures that claim (optional) TSO differ from each other at least for access to normal memory?

        • GeekyBear 17 hours ago ago

          Were you aware that all the BIOS implementations used in PC compatible computers (Compaq, AMI, Phoenix, etc.) were not identical and were compatible to a greater or lesser extent with the original IBM BIOS, yet Linux somehow supported PC compatible computers?

          > Someone saying "it is TSO" is not documentation.

          Trying to re-implement what IBM's BIOS did was not documentation either.

          The original sets the standard, whether a given implementation is perfectly equivalent or not.

          • dmitrygr 17 hours ago ago

            I see no further point for this discussion. Either you truly do not understand or are pretending to not understand the difference between memory models (affect literally every memory access as long as the system is powered up) and BIOS (not used once the OS is up, and thus one-time at-boot quirks handling code can work around most issues). Either way, g'day.

            Oh, and to answer your question, yes, quite aware, actually. I've done quite a bit of low level work over the decades, including, curiously, working in the Apple platform kernel team at the time when this TSO bit appeared.

    • garaetjjte 15 hours ago ago

      Linux never had "must be publicly documented" as a general rule.

      • dmitrygr 15 hours ago ago

        That is fair. Reverse-engineered drivers are common. But depending on undocumented CPU core features seems a bit insane (as indeed the LKML post mentions)

  • brcmthrowaway 14 hours ago ago

    It's a pity that Hector Martin stepped away from all this great work (under that name, anyway)