Comment by banku_brougham

banku_brougham Feb 24, 2025 parent

Really cool article, I've enjoyed your work for a long time. You might add a note for those jumping into a sqlite implementation, that duckdb reads parquet and launched a few vector similarity functions which cover this use-case perfectly:

https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...

jt_b Feb 24, 2025

I have tinkered with using DuckDB as a poor man's vector database for a POC and had great results.

One thing I'd love to see is being able to do some sort of row group level metadata statistics for embeddings within a parquet file - something that would allow various readers to push predicates down to an HTTP request metadata level and completely avoid loading in non-relevant rows to the database from a remote file - particularly one stored on S3 compatible storage that supports byte-range requests. I'm not sure what the implementation would look like to define sorting the algorithm to organize the "close" rows together, how the metadata would be calculated, or what the reader implementation would look like, but I'd love to be able to implement some of the same patterns with vector search as with geoparquet.

jt_b Feb 25, 2025

I thought about this some more and did some research - and found an indexing approach using HNSW, serialized to parquet, and queried from the browser here:

https://github.com/jasonjmcghee/portable-hnsw

Opens up efficient query patterns for larger datasets for RAG projects where you may not have the resources to run an expensive vector database

jasonjmcghee Feb 25, 2025

Hey that's my little research project- lmk if you're interested in chatting about this stuff.

As others have mentioned in other threads, parquet isn't a great tool for the job here, but you could theoretically build a different file format that lends itself better to the problem of static file(s) representing a vector database.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous