I would like to share a Rust implementation of the Zstandard seekable format I've been working on.
Regular zstd compressed files consist of a single frame, meaning you have to start decompression at the beginning. The seekable format splits compressed data into a series of independent frames, each compressed individually, so that decompression of a section in the middle of an archive only requires zstd to decompress at most a frame's worth of extra data, instead of the entire archive.
I started working with the seekable format because I wanted to resume downloads of big zstd compressed files that are decompressed and written to disk on the fly. At first I created and used bindings to the C functions that are available upstream[1], however, I stumbled over the first segfault rather quickly (it's now fixed) and found out that the functions only allow basic things. After looking closer at the upstream implementation, I noticed that is uses functions of the core API that are now deprecated and it doesn't allow access to low-level (de)compression contexts. To me it looks like a PoC/demo implementation that isn't maintained the same way as the zstd core API, probably that's also the reason it's in the contrib directory.
My use-case seemed to require a complete rewrite of the seekable format, so I decided to implement it from scratch in Rust using bindings to the advanced zstd compression API, available from zstd 1.4.0.
The result is a single dependency library crate[2], and a CLI crate[3] for the seekable format that feels similar to the regular zstd tool.
Any feedback is highly appreciated!
[1]: https://github.com/facebook/zstd/tree/dev/contrib/seekable_f... [2]: https://crates.io/crates/zeekstd [3]: https://github.com/rorosen/zeekstd/tree/main/cli
Has zstd actually standardized the seekable version? Last I checked (which was quite a while ago) it had not been declared a standard, so I was reluctant to write a filter for nbdkit, even though it's very much a requested feature.
Why zeek, BTW? Is it a play on "zstd" and "seek"? My employer is also the custodian of the zeek project (https://zeek.org), so I was confused for a second.
[1] https://github.com/SaveTheRbtz/zstd-seekable-format-go
Yes, the name is a combination of zstd and seek. Funnily enough, I wanted to name it just zeek first before I knew that it already exists, so I switched to zeekstd. You're not the first person asking me if there is any relation to zeek and I understand how that is misleading. In hindsight the name is a little unfortunate.
[1] https://www.htslib.org/doc/bgzip.html
The spec does mention:
> While only Checksum_Flag currently exists, there are 7 other bits in this field that can be used for future changes to the format, for example the addition of inline dictionaries.
so I don't think seekable zstd supports these dictionaries just yet.
With multiple inline dictionaries, one could detect when new chunks compress badly with the previous dictionary and train new ones on the fly. Could be useful for compressing formats with headers and mixed data (i.e. game files, which can contain a mix of text + audio + video, or just regular old .tar files I suppose).
https://github.com/facebook/zstd?tab=readme-ov-file#the-case...
Gzip can also have multiple “frames” concatenated together and be seamlessly decrypted. Is this basically the same concept? As mentioned by others bgzip uses this feature of gzip to great effect and is the standard compression in bioinformatics because of it (and is sadly hard coded to limit other potentially useful Gzip extensions).
My interest is to see if using zstd instead of gzip as a basis of a format would be beneficial. I expect for there to be better compression, but I’m skeptical if it would be enough to make it worthwhile.
"Seekable Zstd" is basically just a multi-frame Zstd stream, with the addition of a "seek table" at the end of the file which contains the compressed and uncompressed sizes of every other frame. The seek table itself is marked as a skippable frame, so that seekable Zstd is backward-compatible with normal Zstd decompressors (the seek table is just treated as metadata and ignored).
https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...
The way that’s handled in the bgzip/gzip world is with an external index file (.gzi) with compressed/uncompressed offsets. The index could be auto-computed, but would still require reading the header for each frame.
I vastly prefer the idea of having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame, so that would break naive decompressors. I’m still not sure the file size savings would be big enough to switch over to zstd, but I like the approach.
Looking at the file format RFC (https://www.ietf.org/rfc/rfc1952.txt), the compressed frames are called "members" and each member's header has some optional fields: "extra", "name", and "comment".
The comment is meant to be displayed to users (and shouldn't affect compression) so assuming common decoder software is at least able to properly skip over it, it seems like you could put the index data there.
One way to do it would be to compress everything except the last byte of the input data, then create a separate member just for that last byte. That way you can look at the end of the file and pretty easily find the header because the compressed data that follows it will be very tiny.
One issue with bgzip in particular is that it fixes the gzip header fields allowed, so you can only have one extra value (which is the size of the current block). Because of this, you can’t have new fields in the header for bgzip (the gzip flavor widely used in bioinformatics). One thing I wanted to do was to also add was a header field for sha1/sha256/etc for the current block. When you have files of sufficient size, it can be helpful to have chunk-level signatures to protect against bitrot. This is just one usecase for novel header elements (which is somewhat alleviated as gzip blocks all have their own crc32, but that’s just one idea).
I will add a section to the readme, this is a good question that other people might have too!
This is basically the only function I use from zstd_seekable, so it would be nice to have that in zeekstd as well.
The decompress function in zstd-seekable starts decompression at the beginning of the frame to which the offset belongs and discards data until the offset is reached. It also just stops decompression at the specified offset. Zeekstd uses complete frames as the smallest possible decompression unit, as only the checksum data of a complete frame can be verified.
Given existing libraries, it should be really simple to create an SQLite VFS for my Go driver that reads (not writes) compressed databases transparently, but tool support was kinda lacking.
Will the zstd CLI ever support it? https://github.com/facebook/zstd/issues/2121
1. Huge compression window (like 100+MB, so "chunking" won't work)
2. Random seeking into compressed payload
Anyone know of any projects that can provide both of these at once?
[1] https://github.com/rorosen/zeekstd/blob/main/cli/src/compres...
Explanation here https://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/
For example say you want to seek to 10MB into the uncompressed file. Do you need to store metadata separately to know how many frames to skip?
https://gitlab.com/nbdkit/nbdkit/-/blob/master/filters/xz/xz...
And then kinda learned about criu and I think criu can technically do it but IDK, I in fact started to try to create the zip project in golang but failed it over... Pretty nice to know that zstd exists
Its not a zip file but technically its compressed and I guess you can technically still encode the data in such a way that its essentially zip in some sense...
This is why I come on hackernews.
I also wrote a tool to make a randomly modifiable gzipped disk image: https://rwmj.wordpress.com/2022/12/01/creating-a-modifiable-...
Piping a seekable file for decompression via stdin isn't possible unfortunately. Decompression of seekable files requires to read the seek table first (which is usually at the end of the file) and eventually seek to the desired frame position, so zeekstd needs to able to seek the file.
If you want to decompress the complete file, you can use the regular zstd tool: "cat seekable.zst | zstd -d"