If OP is looking for ideas, there are two intermediate steps between the extremes of "write every line to stdout" and "build up a buffer of the whole output and then write it to stdout".
1. `stdout().lock()` and `writeln!()` to that. By default using `print*!()` will write to `stdout()` which takes a process-wide lock each time. (Funnily enough they use .lock() in the "build up a buffer of the whole output" section, just to do one .write_all() call, which is the one time they don't need to use .lock() because Stdout's impl of write_all() will only take the lock once anyway.)
2. Wrap the locked stdout in a `BufWriter` and `writeln!()` to that. It won't flush on every line, but it also won't buffer the entire output, so it's a middle point between speed and memory usage.
---
For the final proc macro approach, there is the option to unroll the loop in the generated code, and the option to generate a &'static str literal of the output.
In my opinion a more accurate measure when you go down to the micro seconds level is TSC directly from the CPU. I've built a benchmark tool for that https://github.com/sh4ka/hft-benchmarks
Also I think that CPU pining could help in this context but perhaps I need to check the code in my machine first.
Reminds me of the famous thread on stack overflow. I'll link the rust one directly, but one cpp answer claims 283 GB/s - and others are in the ballpark of 50GB/s.
Maybe I’m missing something but can’t you unroll it very easily by 15 prints at a time? That would skip the modulo checks entirely, and you could actually cache everything but the last two or three digits.
> Maybe I’m missing something but can’t you unroll it very easily by 15...
Sure, 3 x 5 = 15. But, FTA:
But then, by coincidence, I watched an old Prime video and decided to put the question to him: how would you extend this to 7 = "Baz"?
He expanded the if-else chain: I asked him to find a way to do it without explosively increasing the number of necessary checks with each new term added. After some hints and more discussion...
Which is why I respectfully submit almost all examples of FizzBuzz including the article's first are "wrong" while the refactor is "right".
As for the optimizations, they don't focus on only 3 and 5, they include 7 throughout.
> At this point, I'm out of ideas. The most impactful move would probably be to switch to a faster terminal... but I'm already running Ghostty! I thought it was a pretty performant terminal to being with!
But what is the point? Why do you want to optimize the display? If you want to be able to fizz-buzz for millions of numbers, then you want to... Well realistically only compute them just before they are displayed.
If you're going to the effort of writing a procmacro, you may as well output a string from the macro instead of code.
If you're going idiomatic rust, then you might instead output a type that has a display impl rather than generating code that writes to stdout.
If OP is looking for ideas, there are two intermediate steps between the extremes of "write every line to stdout" and "build up a buffer of the whole output and then write it to stdout".
1. `stdout().lock()` and `writeln!()` to that. By default using `print*!()` will write to `stdout()` which takes a process-wide lock each time. (Funnily enough they use .lock() in the "build up a buffer of the whole output" section, just to do one .write_all() call, which is the one time they don't need to use .lock() because Stdout's impl of write_all() will only take the lock once anyway.)
2. Wrap the locked stdout in a `BufWriter` and `writeln!()` to that. It won't flush on every line, but it also won't buffer the entire output, so it's a middle point between speed and memory usage.
---
For the final proc macro approach, there is the option to unroll the loop in the generated code, and the option to generate a &'static str literal of the output.
In my opinion a more accurate measure when you go down to the micro seconds level is TSC directly from the CPU. I've built a benchmark tool for that https://github.com/sh4ka/hft-benchmarks
Also I think that CPU pining could help in this context but perhaps I need to check the code in my machine first.
How does this compare with divan?
Reminds me of the famous thread on stack overflow. I'll link the rust one directly, but one cpp answer claims 283 GB/s - and others are in the ballpark of 50GB/s.
The rust one claims around 3GB/s
https://codegolf.stackexchange.com/a/217455
You can take this much further! I think throughput is a great way to measure it.
Things like pre-allocation, no branching, constants, simd, etc
Maybe I’m missing something but can’t you unroll it very easily by 15 prints at a time? That would skip the modulo checks entirely, and you could actually cache everything but the last two or three digits.
> Maybe I’m missing something but can’t you unroll it very easily by 15...
Sure, 3 x 5 = 15. But, FTA:
But then, by coincidence, I watched an old Prime video and decided to put the question to him: how would you extend this to 7 = "Baz"?
He expanded the if-else chain: I asked him to find a way to do it without explosively increasing the number of necessary checks with each new term added. After some hints and more discussion...
Which is why I respectfully submit almost all examples of FizzBuzz including the article's first are "wrong" while the refactor is "right".
As for the optimizations, they don't focus on only 3 and 5, they include 7 throughout.
> At this point, I'm out of ideas. The most impactful move would probably be to switch to a faster terminal... but I'm already running Ghostty! I thought it was a pretty performant terminal to being with!
But what is the point? Why do you want to optimize the display? If you want to be able to fizz-buzz for millions of numbers, then you want to... Well realistically only compute them just before they are displayed.
Because the display is the bottleneck.