Comment by EdSchouten

EdSchouten 5 days ago parent

I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

https://github.com/buildbarn/go-cdc

Scaevolus 5 days ago

This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...

Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)

EdSchouten OP 5 days ago

Yeah, GEAR hashing is simple enough that I haven't considered using anything else.

Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.

pbhjpbhj 5 days ago

The random number generation could match the first 2048 digits of pi, so if it works with _any_ random number...

If it doesn't work with any random number, then some work better than others then intuitively you can find a (or a set of) best seed(s).

Scaevolus 5 days ago

Right, just one fewer module dependency using the stdlib RNG.

rokkamokka 5 days ago

What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?

EdSchouten OP 5 days ago

In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.

If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.

quotemstr 5 days ago

I wonder whether there's a role for AI here.

(Please don't hurt me.)

AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).

Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.

EdSchouten OP 5 days ago

Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.

In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.

xyzzy_plugh 5 days ago

> https://bonanza.build

I just wanted to let you know, this is really cool. Makes me wish I still used Bazel.

This item has no comments currently.