3 comments

  • mikewarot 8 hours ago ago

    I imagine coprocessors that don't have separate memory or instructions... they are effectively huge arrays of look up tables, so that the instructions have the data flow through them. We're at the stage where this is possible for all but the biggest of LLMs.

    A side effect of doing this mapping, even without the hardware, is that the mapping makes a given task inherently parallel, and much, much easier to spread across low cost CPUs. I think of it as a universal solvent for computation.

  • GarvielLoken 13 hours ago ago

    This is actually what DOTS (Unity’s Data-Oriented Technology Stack) do in Unity, so very good to use a game engine as an example! It reportedly is just as enourmous of an performance gain as you show in the article.

    https://unity.com/dots

  • LorenPechtel 9 hours ago ago

    Yup, memory access dominates an awful lot of things. We keep obsessing about more and faster cores, but if they're just waiting on memory it doesn't really do that much. A while back I did an experiment with the Sieve of Eratosthenes--and found that with modern systems the scattered memory access dominates. Finding all primes up to value X was much faster by brute force than using the Sieve. The brute force approach runs entirely from the L1 cache, the only operations outside it are writes. The Sieve ensures the only cache hits are from prefetching.

    While this is obviously an extreme case the reality is that one must consider the cost of precalculated data, you can do quite a few operations for the cost of reading one answer from a table that doesn't fit cache. And there can be substantial benefits when iterating multi-dimensional arrays correctly. A total flip from when I started out where you considered the cost in memory from precalculating (is it worth the memory to build this table of square roots??) to now where the cost in time from looking up the answer (is it worth the memory fetch to look up that square root? Nope.)