Comment by gopalv - Hacker Neue

gopalv Oct 2, 2025 parent

> Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

No, this is a compatibility layer for future encoding changes.

For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.

Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.

We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.

Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.

PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark

RyanHamilton Oct 2, 2025

I'm less confident. Your description highlights a real problem but this particular solution looks like an attempt to shoe horn a technical solution to a political people problem. It feels like one of these great ideas that years later results in 1000s of different decoders, breakages and a nightmare to maintain. Then someone starts an initiative to move decoding from being bundled and to instead just defining the data format.

Sometimes the best option is to do the hard political work and improve the standard and get everyone moving with it. People have pushed parquet and arrow. Which they are absolutely great technologies that I use regularly but 8 years after someone asked how to write parquet in java, the best answer is to use duckdb: https://stackoverflow.com/questions/47355038/how-to-generate...

Not having a good parquet writer for java shows a poor attempt at pushing forward a standard. Similarly arrow has problems in java land. If they can't be bothered to consider how to actually implement and roll out standards to a top 5 language, I'm not sure I want them throwing WASM into the mix will fix it.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous