My ZIP isn't your ZIP: Identifying and exploiting semantic gaps between parsers

(usenix.org)

65 points | by layer8 7 days ago ago

35 comments

saurik 3 days ago ago
I'm cited on the first page of this paper (reference 20) for my work on the Android Master Key vulnerability (which I didn't find, to be clear, but I did most of the exploitation people saw), and, while this paper looks AWESOME (and I'm very excited to read it in detail), if you are interested in this concept but feel you need something a bit more concrete--maybe with diagrams and some hand-holding--to understand what is going on, I will recommend my series of articles on Master Key as an introduction.
https://www.saurik.com/masterkey1.html
https://www.saurik.com/masterkey2.html
https://www.saurik.com/masterkey3.html
schoen 3 days ago ago
This is great. It feels like a central example of the phenomenon of parser differentials (and nice use of tools to find them more efficiently).
Also, as the lead author's name is spelled the same as an English pronoun, we can anticipate natural language parsing ambiguities from writing about this research in English prose! For example, "You discovered that there are many opportunities for parser differentials due to the underspecified nature of the ZIP format" or "You described a practical method of bypassing plagiarism detectors and several other kinds of file content scanners".
Actually, I'm tempted to propose that for the April Fool's Did You Know? on Wikipedia next year. "Did you know ... that You won a Usenix Security award for finding ways to construct ambiguous texts?"
tptacek 3 days ago ago
This is a really good paper that reaches a bunch of fun conclusions, but to my eyes the practical findings are kind of marginal --- you can defeat an AV scanner, but you could already defeat AV scanners; you can defeat plagiarism-detectors, but you could already defeat plagiarism-detectors; you can package a malicious Java class in a benign-looking JAR, but that attack presumes you're convincing a target to load a JAR file you control.
The one legit-practical attack I see is the one where they trick the VS Code Extension marketplace into serving extensions with trusted publishers, but even there I'm struck by the fact that the security model for verifying extensions would depend on ZIP metadata.
I do not at all mean to talk this work down; this is my favorite species of vulnerability research, and I can see why it did well at Usenix Security.
[-]
- FreakLegion 3 days ago ago
  It's a decent systematic look at something people have been doing ad hoc for a long time. In 2010 or so I realized:
  1. Authenticode signatures have unauthenticated sections.
  2. ZIP files don't require headers.
  So you can shove a ZIP file (i.e. JAR, DOCM, APK, etc.) into a signed Windows executable without breaking its signature, and then depending on the extension it will do any number of things when clicked.
  (The extent to which this works has changed a lot in the intervening years, but prior to a patch in 2013 it was especially bad, and the patches never made their way into the spec, so custom Authenticode validators like Wine's or, say, the one in Palo Alto Networks gear, were still vulnerable the last time I checked.)
  Anyway, at the same time:
  1. Cybersecurity products lean on Authenticode to keep false positives down for specific publishers.
  2. Those same products cache everything by hash without regard for file type.
  Put all of this together and you could, as of 2020 at least, not only execute whatever you wanted, you could also have it misreported by CrowdStrike or whoever as a signed Windows component.
  Fun stuff, but I agree that it's kind of marginal.
- layer8 2 days ago ago
  The attack vector for publishing extensions existed for Firefox (and was fixed): https://bugzilla.mozilla.org/show_bug.cgi?id=1534483
pabs3 3 days ago ago
A linter for zip files that can probably detect some of these:
https://github.com/ronomon/pure
[-]
- cxr 2 days ago ago
  1. Describing this as a "linter for zip files" is kind of weird—this library is a full-on ZIP implementation that is meant to be used for the kinds of things people use any sort of ZIP library for.
  2. It's one of the libraries that the authors of the paper cited and subjected to testing. It's column/row 31—the one that is the source of the prominent vertical/horizontal bands in Table 4 (on p. 450 aka p. 21)
  [-]
  - pabs3 2 days ago ago
    1. I think you are thinking of https://github.com/ronomon/zip? The description for pure says it is a static analysis tool for zip files. That makes it a linter in my book.
    2. I see, thanks.
    [-]
    - cxr a day ago ago
      Yes, I was wrong.
      (HN obscures the end of the URL; I assumed it was Ronomon's ZIP library. The 2 in my comment also applies to that library.)
est 3 days ago ago
IIRC similar attacks exist on DEFLATE
there used to be a .png picture displays totally different content on safari/firefox/IE.
captn3m0 3 days ago ago
Also related to ZIP parsing differentials, recently reported and fixed at PyPi: https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusi...
[-]
- tptacek 3 days ago ago
  It's good to see stuff like this getting found and fixed, but let me ask: given how the Python packaging ecosystem works, what is the practical scenario in which this would be exploitable?
  [-]
  - cxr 2 days ago ago
    woodruffw writes in the corresponding HN thread:
    > security scanners are a simple example, but Linux distros, Homebrew, etc. all also process Python package distributions in ways that mostly just assume a ZIP container, without additionally trying to exactly match how Python's `zipfile` behaves
    <https://news.ycombinator.com/item?id=44829881>
    This doesn't necessarily unlock any new capabilities, but in light of the xz exploit (whereby you have a repo over there that ostensibly corresponds to the package published right here, but with the latter actually comprising a different payload of runnable code), it's not inconceivable that an attacker would take advantage of the behavior between different implementations to level up the obfuscation/misdirection and evade detection for longer.
    (FWIW I regarded at the time (and still regard) the hoopla around the PyPI/Astral blog posts a tad overblown, with the purported threat vague at best—especially where the claims about the ambiguity of the ZIP format that are at the crux of the issue are already dubious. On the latter point, it's nice that the authors of the USENIX paper contrast between implementations that use the "standard" method versus otherwise.)
    [-]
    - tptacek 2 days ago ago
      I actually talked to 'woodruffw just before writing that comment. :)
  - 2 days ago ago
    [deleted]
pixl97 3 days ago ago
Zip is a fun minefield across different OS's, libraries, and ages of system. Zip64 is a fun one I've seen companies forget to test and end up with data loss with over 65535 files in a zip when interacting with more modern systems. There are really so many things you need to test that going with some other compression without the pitfalls is your best choice if possible.
o11c 3 days ago ago
Key line from the abstract, since zip parser differences in general are old news:
> We summarize our findings as 14 distinct parsing ambiguity types in three categories with detailed analysis, systematizing current knowledge and uncovering 10 types of new parsing ambiguities.
actionfromafar 3 days ago ago
Tampering with signed binaries sounds pretty serious
[-]
- tptacek 3 days ago ago
  It depends on how they're signed. A signature format that works on individual objects inside of an archive, rather than on a whole signed archive, seems crazy. In this case, it's a JAR file loader; doesn't seem like that big a deal?
  [-]
  - layer8 2 days ago ago
    If you want to have the archive contain the signature, you can’t sign the whole archive. Signed documents (docx, odf) work that way.
layer8 7 days ago ago
[dead]
hinkley 3 days ago ago
Maybe an argument to use zlib consistently.
[-]
- aaviator42 3 days ago ago
  An argument for a better defined file format specification perhaps, but I don't think it's necessarily a good thing for everyone to use or have to use the same implementation.
  [-]
  - socalgal2 3 days ago ago
    As someone who works on specs that are shared across different organizations' implementations, you can write all the specs you want but no conformance tests = no conformance.
    [-]
    - aaviator42 2 days ago ago
      A good point! Conformance tests seem like a great idea to me to go along with specs.
  - Muromec 3 days ago ago
    If everyone has the same parser the whole classes of bugs just stop being exploitable. The classic one being one parser at the edge validates somethhing and the further down the line sees another result which it expects tp be rejected during validation.
    Both parsers could be buggy, but when they have different kinds of bugs, you get a zero click undetectable exploit
    [-]
    - woodruffw 3 days ago ago
      I don’t think it’s this simple: you can still produce observable differentials with a single parser by using different options within that parser in different places. The ZIP format itself affords ample opportunities for that.
      [-]
      - hinkley 2 days ago ago
        The settings are at encode time. For two readers the results should be unambiguous.
        [-]
        woodruffw 2 days ago ago
        There are plenty of decode-time knobs, even within a single ZIP parser. Here are just a few you could set while using libzip[1].
        [1]: https://libzip.org/documentation/zip_open.html#DESCRIPTION
        [-]
        hinkley 2 days ago ago
        That’s not a lot of settings, and that’s libzib, which is not zlib.
        [-]
        woodruffw 2 days ago ago
        Differentials are oracular; you only need one bit. And I’m not claiming it’s in zlib, since zlib isn’t a ZIP library. TFA here is about ZIP differentials, not differentials in DEFLATE stream parsers.
    - aaviator42 2 days ago ago
      It significantly increases the attack surfaces of bugs that do exist in the parser if the same implementation is used everywhere.
- woodruffw 3 days ago ago
  Unless, of course, the differential occurs between versions of zlib. I think the bigger problem here is that ZIP is just not a very well defined format.
- blibble 3 days ago ago
  zlib (deflate) is just the compression type usually (not always) used in zips
  zip is the container around it
  [-]
  - pdw 3 days ago ago
    zlib comes with a basic ZIP implementation (libminizip).