As a hobbyist, shaders is up there as one of the most fun types of programming.. Low-level / relatively simple language, often tied to a satisfying visual result. Once it clicks, it's a cool paradigm to be working in, e.g. "I am coding from the perspective of a single pixel".
Nothing like outputting specific colors to see what branch the current pixel is currently running through. It's like printf debugging but colorful and with only three floats of output.
I agree it’s very difficult to debug them. I sometimes rewrite my shaders in Vex and debug them in that. It’s a shader language that runs on the CPU in Houdini. You can output a value at each pixel which is useful for values outside the range of 0 to 1 or you can use printf(). I’m still looking for something that will transpile shaders into JavaScript.
I wish I had an intuitive understanding of how much I can do with a GPU. E.g. how many points can I move around? A simulation like this would be great for that.
They mention it’s 3x faster when turning collision off. I don’t know what the memory footprint of a block is, but I’d speculate that small round particles (sphere plus radius) are an order of magnitude faster.
Modern GPUs are insanely fast. A higher end consumer GPU like a 5090 can do over 100 teraflops of fp32 computation if your cache is perfectly utilized and memory access isn’t the bottleneck. Normally, memory is the bottleneck, and at a minimum you need to read and write your particles every frame of a sim, which is why the sibling comments are using memory bandwidth to estimate the number of particles per second. I’d guess that if you were only adverting particles without collision, or colliding against only a small number of big objects (like the particles collide against the planet and not each other) then you could move multiple billions of particles per second, which you would might divide by your desired frame rate to see how many particles per frame you can do.
the answer is a big depends. but I can give you some ballpark intuition.
perhaps it's easiest to think about regular image processing because it uses the same hardware. you can think about each pixel as a particle.
a typical 4k (3840 x 2160 at 16:9) image contains about 8 million pixels. a trivial compute shader that just writes 4 bytes per pixel of some trivial value (e.g. the compute shader thread ids) will take you anywhere from roughly speaking 0.05ms - 0.5ms on modern-ish GPUs. this is a wide spread to represent a wide hardware spread. on current high end GPUs you will be very close to the 0.05ms, or maybe even a bit faster.
but real world programs like video games do a whole lot more than just write a trivial value. they read and write a lot more data (usually there are many passes - so it's not done just once - in the end maybe a few hundred bytes per pixel), and usually run many thousands of instructions per pixel. I work on a video game everyone's probably heard about and the one of the main material shaders is too large to fit into my work GPUs instruction cache (of 32kb) to give you an idea how many instructions are in there (not all executed of course - some branching involved).
and you can still easily do this all at 100+ frames per second on high end GPUs.
so you can in principle simulate a lot of particles. of course, the algorithm scaling matters. most of rendering is somewhere in O(n). anything involving physics will probably involve some kind of interaction between objects which immediately implies O(n log n) at the very least but usually more.
For examples like particle simulations, on a single node with a 4090 GPU everything running on GPU without memory transfer to the CPU:
-The main bottleneck is memory usage : available 24GB, Storing the particles 3 position coordinates, + 3 velocity coordinates, 4 bytes by number (float32) = Max 1B particles
-Then GPU memory bandwidth : if everything is on the GPU you get between 1000GB/s of global memory access and 10000GB/s when shared memory caches are hit. The number of memory access is roughly proportional to the number of effective collisions between your particles which is proportional to the number of particles so around 12-30 times ( see optimal sphere packing number of neighbors in 3d, and multiply by your overlap factor). All in all for 1B particles, you can collision them all and move them in 1 to 10s.
If you have to transfer things to the CPU, you become limited by the PCI-express 4.0 bandwidth of 16GB/s. So you can at most move 1B particles to and from the GPU, 0.7 times per second.
Then if you want to store the particle on disk, instead of RAM because your system is bigger, then you can either use a M2 ssd (but you will burn them quickly) which has a theoretical bandwidth of 20GB/s so not a bottleneck, or use a network storage over 100Gb/s (= 12.5GB/s) ethernet, via two interfaces to your parameter server which can be as big as you can afford.
So to summarize so far : 1B particles takes 1 to 10s per iteration per GPU. If you want to do smarter integration schemes like Rk4, you divide by 6. If you need 64 bits precisions you divide by 2. If you only need 16bits precisions you can multiply by 2.
The number of particle you need : Volume of the box / h^3 with h the diameter of the particle = finest details you want to be able to resolve.
If you use an adaptive scheme most of your particles are close to the surface of objects so O( surface of objects / h^2 ) with h=average resolution of the surface of the mesh. But adaptive scheme is 10 times slower.
The precision of the approximation can be bounded by Taylor formula. SPH is typically order 2, but has issues with boundaries, so to represent a sharp boundary the h must be small.
If you want higher order and sharp boundaries, you can do Finite Element Method, instead. But you'll need to tessellate the space with things like Delaunay/Voronoi, and update them as they move.
Might be worth starting with a baseline where there’s no collision, only advection, and assume higher than 1fps just because this gives higher particles per second but still fits in 24GB? I wouldn’t be too surprised if you can advection 100M particles at interactive rates.
Well, to get that intuition, I guess you have to start experimenting. WebGPU is quite easy to get started with the concept. But in general it obviously depends what kind of GPU you have.
Ah yes, I dreamed about doing something like this, just with even more details ages ago, but concluded, I won't get even close to what I want, without having a big team at disposal and a supercomputer and/or a couple of universities collaborating interdisciplinary. But so far I was buisy with other things and reading about his experience unsurprisingly kind of confirms the challenge there is - mainly performance. But GPUs are on the rise and I am optimistic for the future. If the AI bubble bursts, I suppose lots of cheap GPU power will be avaiable for experiments like these and more elaborate ones. And if not, compute power/money will likely rise anyway.
The tectonics.js blog has some really incredible write-ups on how to do proper simulation of plate tectonics: https://davidson16807.github.io/tectonics.js/blog/news.html
As a hobbyist, shaders is up there as one of the most fun types of programming.. Low-level / relatively simple language, often tied to a satisfying visual result. Once it clicks, it's a cool paradigm to be working in, e.g. "I am coding from the perspective of a single pixel".
I found them fun once they work, but if something did not work, debugging them I did not enjoy so much.
Nothing like outputting specific colors to see what branch the current pixel is currently running through. It's like printf debugging but colorful and with only three floats of output.
I agree it’s very difficult to debug them. I sometimes rewrite my shaders in Vex and debug them in that. It’s a shader language that runs on the CPU in Houdini. You can output a value at each pixel which is useful for values outside the range of 0 to 1 or you can use printf(). I’m still looking for something that will transpile shaders into JavaScript.
This looks very ambitious, it's really starting from the basics, simulating tectonic plates.
Sadly, there never was a Part 2, was it?
I guess life just got in the way, as usual.
I wish I had an intuitive understanding of how much I can do with a GPU. E.g. how many points can I move around? A simulation like this would be great for that.
Here’s a datapoint: this project simulates ~100K rigid body blocks per second will full collision in 10 milliseconds (or roughly ~10M blocks per second). https://graphics.cs.utah.edu/research/projects/avbd/
They mention it’s 3x faster when turning collision off. I don’t know what the memory footprint of a block is, but I’d speculate that small round particles (sphere plus radius) are an order of magnitude faster.
Modern GPUs are insanely fast. A higher end consumer GPU like a 5090 can do over 100 teraflops of fp32 computation if your cache is perfectly utilized and memory access isn’t the bottleneck. Normally, memory is the bottleneck, and at a minimum you need to read and write your particles every frame of a sim, which is why the sibling comments are using memory bandwidth to estimate the number of particles per second. I’d guess that if you were only adverting particles without collision, or colliding against only a small number of big objects (like the particles collide against the planet and not each other) then you could move multiple billions of particles per second, which you would might divide by your desired frame rate to see how many particles per frame you can do.
the answer is a big depends. but I can give you some ballpark intuition.
perhaps it's easiest to think about regular image processing because it uses the same hardware. you can think about each pixel as a particle.
a typical 4k (3840 x 2160 at 16:9) image contains about 8 million pixels. a trivial compute shader that just writes 4 bytes per pixel of some trivial value (e.g. the compute shader thread ids) will take you anywhere from roughly speaking 0.05ms - 0.5ms on modern-ish GPUs. this is a wide spread to represent a wide hardware spread. on current high end GPUs you will be very close to the 0.05ms, or maybe even a bit faster.
but real world programs like video games do a whole lot more than just write a trivial value. they read and write a lot more data (usually there are many passes - so it's not done just once - in the end maybe a few hundred bytes per pixel), and usually run many thousands of instructions per pixel. I work on a video game everyone's probably heard about and the one of the main material shaders is too large to fit into my work GPUs instruction cache (of 32kb) to give you an idea how many instructions are in there (not all executed of course - some branching involved).
and you can still easily do this all at 100+ frames per second on high end GPUs.
so you can in principle simulate a lot of particles. of course, the algorithm scaling matters. most of rendering is somewhere in O(n). anything involving physics will probably involve some kind of interaction between objects which immediately implies O(n log n) at the very least but usually more.
TLDR : 1B particles ~ 3s per iterations
For examples like particle simulations, on a single node with a 4090 GPU everything running on GPU without memory transfer to the CPU:
-The main bottleneck is memory usage : available 24GB, Storing the particles 3 position coordinates, + 3 velocity coordinates, 4 bytes by number (float32) = Max 1B particles
-Then GPU memory bandwidth : if everything is on the GPU you get between 1000GB/s of global memory access and 10000GB/s when shared memory caches are hit. The number of memory access is roughly proportional to the number of effective collisions between your particles which is proportional to the number of particles so around 12-30 times ( see optimal sphere packing number of neighbors in 3d, and multiply by your overlap factor). All in all for 1B particles, you can collision them all and move them in 1 to 10s.
If you have to transfer things to the CPU, you become limited by the PCI-express 4.0 bandwidth of 16GB/s. So you can at most move 1B particles to and from the GPU, 0.7 times per second.
Then if you want to store the particle on disk, instead of RAM because your system is bigger, then you can either use a M2 ssd (but you will burn them quickly) which has a theoretical bandwidth of 20GB/s so not a bottleneck, or use a network storage over 100Gb/s (= 12.5GB/s) ethernet, via two interfaces to your parameter server which can be as big as you can afford.
So to summarize so far : 1B particles takes 1 to 10s per iteration per GPU. If you want to do smarter integration schemes like Rk4, you divide by 6. If you need 64 bits precisions you divide by 2. If you only need 16bits precisions you can multiply by 2.
The number of particle you need : Volume of the box / h^3 with h the diameter of the particle = finest details you want to be able to resolve.
If you use an adaptive scheme most of your particles are close to the surface of objects so O( surface of objects / h^2 ) with h=average resolution of the surface of the mesh. But adaptive scheme is 10 times slower.
The precision of the approximation can be bounded by Taylor formula. SPH is typically order 2, but has issues with boundaries, so to represent a sharp boundary the h must be small.
If you want higher order and sharp boundaries, you can do Finite Element Method, instead. But you'll need to tessellate the space with things like Delaunay/Voronoi, and update them as they move.
Might be worth starting with a baseline where there’s no collision, only advection, and assume higher than 1fps just because this gives higher particles per second but still fits in 24GB? I wouldn’t be too surprised if you can advection 100M particles at interactive rates.
Well, to get that intuition, I guess you have to start experimenting. WebGPU is quite easy to get started with the concept. But in general it obviously depends what kind of GPU you have.
Ah yes, I dreamed about doing something like this, just with even more details ages ago, but concluded, I won't get even close to what I want, without having a big team at disposal and a supercomputer and/or a couple of universities collaborating interdisciplinary. But so far I was buisy with other things and reading about his experience unsurprisingly kind of confirms the challenge there is - mainly performance. But GPUs are on the rise and I am optimistic for the future. If the AI bubble bursts, I suppose lots of cheap GPU power will be avaiable for experiments like these and more elaborate ones. And if not, compute power/money will likely rise anyway.