Writing a competitive BZip2 encoder in Ada from scratch in a few days – part 2

(gautiersblog.blogspot.com)

104 points | by ajdude 4 days ago ago

7 comments

  • nayuki 13 hours ago ago

    > Another surprising feature of the BWT is that for reversing the permutation, no extra information is needed!

    I think this is not true. The way I learned the BWT, after encoding, you need to store the index of the first character (which is a tiny bit of extra information). https://web.archive.org/web/20170325024404/http://marknelson...

    • amy_petrik 5 hours ago ago

      well here's a big surprise of BWT

      BWT is a transformation transforming a string into "suffix tree space"

      You can do searches - very fast searches - in BZ2/BWT compression space

      Discussed in the '90s in the legendary dr. dobbs

      But more significantly, BWT is the basis for most DNA and RNA sequence aligners nowadays - which perform string search/matching in suffix tree (transformed) space. It's like doing frequency stuff in fourier domain, it's like adding is multiplication in logspace - but in weird data string search space. The BWT.

    • horizion2025 10 hours ago ago

      I also think when looking at how BWT works, it is initially surprising it is invertible even with such little information as the first character. It appears a bit like sorting the letters and that certainly would require a lot more information to back into place. It is a bit magic even when you know the inversion algorithm.

    • vintermann 13 hours ago ago

      There is a fully bijective version of the BWT which doesn't require the index. It's not what BZip2 uses though.

      • Someone 12 hours ago ago

        That method adds an extra character to your alphabet, and uses that as a marker in the transformed string to h effectively) store the index of the first character inside the transformed text.

        If, as is often the case, your alphabet contains exactly as many letters as a byte/word can store, that makes for awkward encoding. I also think that, typically, that method will require more memory.

        • vintermann 11 hours ago ago

          No, I'm not talking about that. I'm talking about David Scott's bijective BWT variant. It sorts symbols by a slightly different type of context, and so can do away with a sentinel symbol or a stored index entirely. It was described in the wikipedia page last I checked.

          Scott was a bit of a weirdo, but his algorithm actually got recognition in academia, there have been many papers written about it. One character saved doesn't make a lot of difference for compression, but the transform being a permutation does give it some nice properties.

    • etrez 2 hours ago ago

      Thx, corrected.