The Hard Problem of Unzipping

A brief voyage into NPM package quality.

It wouldn’t be ridiculous to suggest that in practice, the two most important features of a programming language/runtime are the diversity and quality of available third-party packages. They’re why, I think, something like Python remains a reasonable choice for many problem domains, despite the language itself being otherwise uninteresting1 - or why Clojure, on the JVM, is a much more sensible choice than any number of equally thoughtful languages (Dylan, Qi, etc.) which may look as attractive from a distance.

Obviously, quantifying the above statements with any rigour would be difficult2, and’d involve a bunch of tricky assumptions - I’m only offering them as background to a worthless anecdote about my recent attempt to programmatically decompress a ZIP archive in Javascript, on Node.

1 A Python programmer might tell you that the average package quality is high precisely because the language itself isn’t particularly interesting. Javascript is an elegant refutation of that hypothesis.
2 And has been tried elsewhere.

What Are You Looking For?

The ZIP format has been around for just under 30 years, and is an archive format which optionally supports compression of entries. Basically, a portable abstraction of a filesystem tree.

The specific problem I was trying to solve was that of unzipping an archive which exists on the local filesystem, writing the output to some other location on the local filesystem. In other words, exactly what unzip archive.zip will do out of the box on most Unix-based operating systems. I wasn’t attracted to Node because I’d heard about its legendary ability to decompress files - this feature was part of a larger problem for which Node was a requirement.

The Paradox of Choice

As far as picking an implementation (there are over 100 results for “unzip” on npmjs.org), there were three properties I was looking for:

Doesn’t Buffer Unnecessarily

There’s nothing intrinsic to ZIP files which requires that the entire input or output resides in memory at once, though it seems a popular implementation strategy on Node. A recurring theme is packages which use the stream API for decompression, which is not a great fit for a format in which the definitive metadata block is stored at the end.

Restores Unix File Permissions on Unix

File permissions are both operating system specific, and absent from the ZIP specification. That said, on Unix-based systems, the well-established convention is to store modes in the “extended file attributes” field. Equivalent functionality would be fine, i.e. a library which delegates file creation to the consumer, and makes the extended attributes directly available.

Knows When It’s Finished

In other words, assuming there have been no errors, and the package itself is creating the output files, it ought to be capable of emitting an event / resolving a promise at such a time as the unzipped files have been created.

Shootout, By Popularity

adm-zip

> 3,700,000 downloads in the last month

  • Buffering: Reads entire file into memory
  • Permissions: Can’t restore
  • Completion: Synchronous

Score: one zip out of three

yauzl

> 3,600,000 downloads in the last month

  • Buffering: Doesn’t stream, reads from Central Directory
  • Permissions: Made available via entry items
  • Completion: Delegates writing to consumer

Score: three zips out of three

decompress

>1,100,000 downloads in the last month

  • Buffering: Stores entire decompressed output in memory!
  • Permissions: Restores.
  • Completion: Appears correct.

Score: two zips out of three

unzip

> 250,000 downloads in the last month

Failed immediately on the valid zip file I was testing with: invalid signature: 0x80014. The software seems to have been abandoned a couple of years ago, but remains popular.

Score: zero zips!

unzip2

~40,000 downloads in the last month

  • Buffering: Streaming API. As it doesn’t buffer, it must read metadata from file-local headers rather than the Central Directory suffix.
  • Permissions: Can’t restore.
  • Completion: Appears correct.

Score: two zips out of three

unzipper

> 13,500 downloads in the last month

  • Buffering: Streaming API, per unzip2
  • Permissions: Can’t restore
  • Completion: unzip.Extract never emits completion event

Score: one zip out of three

Conclusion

It’s no coincidence that the one library above (yazul) which scored 3/3 on this benchmark (arbitrary, infallible) does so by forcing the caller to write the files, and explicitly avoids read streams. Unzipping & restoring an archive file is a compound operation, with a bunch of interactions - not well suited to a single function of two arguments.

I think there’s a couple of reasons why this happens:

  • Libraries which appear inordinately convenient for a subset of use-cases will achieve some degree of popularity.
  • Asynchronous coordination in a syntactically rigid language is difficult - the fewer operations, the easier this is.