Load-Store Conflicts

(zeux.io)

114 points | by ashvardanian 2 days ago ago

5 comments

  • Sesse__ 2 days ago ago

    I find Clang generally a bit too eager to combine loads. This is especially bad when returning structs through the stack; you typically write them piecemeal in some function, return, and then the caller often wants to copy it from the stack into somewhere else, which it does with SIMD loads/stores.

    This is a significant problem on AMD; Intel and Apple seems to be better.

    • cwzwarich 2 days ago ago

      > This is a significant problem on AMD; Intel and Apple seems to be better.

      When did this change? In my testing years ago (while I was writing Rosetta 2, so Icelake-era Intel), Intel only allowed a load to forward from a single store, and no partial forwarding (i.e. mixed cache/register) without a huge penalty, whereas AMD at least allowed partial forwarding (or had a considerably lower penalty than Intel).

      • Sesse__ 2 days ago ago

        I don't know if AMD allows more or fewer _situations_, but empirically, I'm seeing a lot of total cycles lost to this on Zen 2 and 3, and much less on the Intel CPUs I've been testing (mostly Skylake derivatives and Alder Lake).

        I haven't tested Zen 4 or 5, but I haven't heard anything that indicates they should be a lot better.

        • cwzwarich 2 days ago ago

          Interesting! IIRC, the LLVM passes dedicated to dodging this issue were contributed by Intel engineers, so maybe there’s some bias.

  • haberman 2 days ago ago

    A very interesting article that goes deeper into store-to-load forwarding than anything I’ve read before.