Decoding UTF-8. Part III: Determining Sequence Length – A Lookup Table

(nemanjatrifunovic.substack.com)

9 points | by rbanffy 5 days ago ago

2 comments

  • procaryote a day ago ago

    Right-shifting three bits would reduce the size of the lookup table to 32 slots

    I guess something like

        const int extra_bits = (sizeof(int) - 1) * 8;
        int x = __builtin_clz(~lead_byte);
        return (x == 0) + (x > 1 + extra_bits) * (x < 5 + extra_bits) * (x - extra_bits));
    
    could work, although I've not tested it for all cases or checked if it's fast

    The idea there is to invert the bits, use a built in operation to count leading zeros (i.e. leading ones in the original byte) and then do some math to achieve the same semantics as the lookup table

    • zahlman a day ago ago

      > Right-shifting three bits

      This is not compatible with the special cases that need to be checked (e.g. c0 and c1 start bytes must be rejected).