On semi-related note, it's worth noting that if you're trying to make a Python script run faster and don't have the know-how to re-write your program in C or how to write SIMD (if applicable), you can always try to run the script through pypy, merely replacing python3 with pypy3 in bench.sh, with no other changes, brings the runtime of the first program down from 104s to 9s on my machine:
Benchmark 1: python3 0_mvp.py bench.txt
Time (mean ± σ): 104.739 s ± 3.982 s [User: 104.213 s, System: 0.158 s]
Range (min … max): 100.303 s … 108.005 s 3 runs
Benchmark 2: python3 1_c_regex.py bench.txt
Time (mean ± σ): 14.777 s ± 0.017 s [User: 14.563 s, System: 0.158 s]
Range (min … max): 14.759 s … 14.791 s 3 runs
Benchmark 1: pypy3 0_mvp.py bench.txt
Time (mean ± σ): 9.381 s ± 0.204 s [User: 9.110 s, System: 0.234 s]
Range (min … max): 9.245 s … 9.616 s 3 runs
Benchmark 2: pypy3 1_c_regex.py bench.txt
Time (mean ± σ): 4.296 s ± 0.031 s [User: 4.038 s, System: 0.236 s]
Range (min … max): 4.262 s … 4.324 s 3 runs
You can avoid hard-coding the whitespace symbols and have a generic byte-set search kernel via `vpshufb` AVX512BW-capable CPUs [1] or via `tbl` instructions on NEON-capable CPUs [2].
You don't need AVX512BW for shuffle, SSSE3 will do. (Of course, if you want wider registers, you'll need the newer versions such as AVX2 or AVX512, but they don't shuffle cross-lane.)
This was a nice opportunity to learn about Grand Central Dispatch through AI for me as well. I knew about the page alignment and async read techniques but not on OSX.
It's a wonderful problem for optimizing code. Michael Abrash hosted a performance contest for word counting back in... 1991? (If my memory serves.) The article and code can be found here:
There Ain’t No Such Thing as the Fastest Code:
Lessons Learned in the Pursuit of the Ultimate Word Counter
As it happens, splitting input text into words fast is one of the things I most want to do this week! But maybe that's because it's a distraction from benchmarking hash tables.
I don't know ARM, but an alternate approach, if it's available, is to store the query constants as bitmasks in SIMD registers; and use the input bytes as indices into those constants, using a shuffle instruction. Two levels, to pull out a bit from a 256-bit mask: part of an input byte is used to index a byte (SIMD shuffle), and another part indices a bit within the byte (bit shifts).
Idea being, this is constant in the size of the query set.
Both the alternative version by Geoff Langdale, and the special case for small sets, are substantially similar to the algorithms used in Hyperscan (truffle and shufti).
https://github.com/intel/hyperscan
Having something hard coded for spaces can be much faster, especially since 5 of the 6 characters are a range: a wrap-around subtraction and an unsigned less-than does the first 5; an equality compare does the other.
On semi-related note, it's worth noting that if you're trying to make a Python script run faster and don't have the know-how to re-write your program in C or how to write SIMD (if applicable), you can always try to run the script through pypy, merely replacing python3 with pypy3 in bench.sh, with no other changes, brings the runtime of the first program down from 104s to 9s on my machine:
You can avoid hard-coding the whitespace symbols and have a generic byte-set search kernel via `vpshufb` AVX512BW-capable CPUs [1] or via `tbl` instructions on NEON-capable CPUs [2].
[1]: https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...
[2]: https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...
You don't need AVX512BW for shuffle, SSSE3 will do. (Of course, if you want wider registers, you'll need the newer versions such as AVX2 or AVX512, but they don't shuffle cross-lane.)
> Perform a bitwise AND with 0xFB and check if the result equals 0x09. Both 0x0D & 0xFB = 0x09 and 0x09 & 0xFB = 0x09.
This explanation was a bit unsatisfying. This works because 0x09 and 0x0D differ by a single bit, and 0xFB masks that bit (and only that bit) out.
If they differed by more than one bit, the fact that they & the same would be necessary but not sufficient.
I was able to get a little bit faster than the multithreaded version with a single thread using page-aligned reads and Grand Central Dispatch for asynchronous read operations: https://github.com/healeycodes/counting-words-at-simd-speed/...
This was a nice opportunity to learn about Grand Central Dispatch through AI for me as well. I knew about the page alignment and async read techniques but not on OSX.
Edit: actually you can get even faster with mmap https://github.com/healeycodes/counting-words-at-simd-speed/...
It's a wonderful problem for optimizing code. Michael Abrash hosted a performance contest for word counting back in... 1991? (If my memory serves.) The article and code can be found here:
There Ain’t No Such Thing as the Fastest Code: Lessons Learned in the Pursuit of the Ultimate Word Counter
Article: https://www.phatcode.net/res/224/files/html/ch16/16-01.html
Code: https://www.phatcode.net/res/224/files/html/ch16/16-05.html
As it happens, splitting input text into words fast is one of the things I most want to do this week! But maybe that's because it's a distraction from benchmarking hash tables.
Author here, one of the fastest improvements I've seen is @cloud11665's idea here: https://x.com/cloud11665/status/1955958965046595699
I don't know ARM, but an alternate approach, if it's available, is to store the query constants as bitmasks in SIMD registers; and use the input bytes as indices into those constants, using a shuffle instruction. Two levels, to pull out a bit from a 256-bit mask: part of an input byte is used to index a byte (SIMD shuffle), and another part indices a bit within the byte (bit shifts).
Idea being, this is constant in the size of the query set.
But that's slower for small query sizes.
This describes a few algorithms: http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html
Both the alternative version by Geoff Langdale, and the special case for small sets, are substantially similar to the algorithms used in Hyperscan (truffle and shufti). https://github.com/intel/hyperscan
Having something hard coded for spaces can be much faster, especially since 5 of the 6 characters are a range: a wrap-around subtraction and an unsigned less-than does the first 5; an equality compare does the other.
Would be interesting to see how far this is with exaloop's codon compiler.
I think you can just xor the whitespace mask with the shifted one.
Also, when counting 0xFF bytes from a boolean etc., sub the mask; 0xFF == -1.