Comment by screye - Hacker Neue

Graph RAG is great, but misunderstood.

As a human user, consider file-system navigation or code search. You navigate to a seed file, and then hop through the children and dependencies until you find what you're looking for. The value is in explicit and direct edges. Perfect search is hard. Landing in the right neighborhood is less so. (It's like golf if you think about it)

Agentic systems loop through the following steps - (Plan -> Inquire -> Retrieve -> Observe -> Act -> Repeat). The agent interacts with your search-system during the inquire and retrieve phases. In these phrases, There are 2 semantic problems that a simple embedding based search or a simple db alone can't solve: seeding and completeness. Seeding - How do you ask a good question when you don't know what you don't know ? Completeness - once you know a little bit, how do you know that you have obtained everything you need to answer a question ?

A solid embedding based search allows under-defined free-form inquiry, and puts the user near the data they're looking for. From there, an explicit graph allows the agent to navigate through the edges until it hits gold or gives the agent enough signal to retry with better informed free-form inquiry. Together, they solve the seeding problem. Now, once you have found a few seed nodes to work off of, the agent can keep exploring the neighbors, until they become sufficiently irrelevant. At that threshold, the retrieval system can return the explored nodes with a measurable metric of confidence in completeness. This makes completeness a measure that you can optimize, helping solve the 2nd problem.

You'll notice that there is no magic here. The quality of your search will depend on the quality of your edges, entities, exploration strategy and relevance detectors. This requires a ton of hand-engineering and subject specific domain knowledge, neither of which are systems bottlenecks. The data-store itself will do very little to help get you a better answer.

Which brings me to your question, the datastore. The datastore only matters at sufficient scale. You CAN implement Graph RAG in a standard database. Get a column to track your edges, a column to track entities and some way to search over embeddings and you're good. You can get it done in an afternoon (until permissions become an issue, but I digress).

We know that a spotlight style file-system search works just fine on 100k+ documents, while your mac's fan barely even turns on. If you're asking this question, then your company probably doesn't scale past that point. In fact, I'd argue that few companies will ever cross that threshold for agentic operations. At this scale, your postgres instance won't be the bottleneck.

Comparing postgres to graph-rag-startups, the real value of using a native graph-RAG solution is their defaults. The companies know that their user's need is agentic semantic search, and the products come preloaded with defaults that give you embeddings, entities and graph-edges that aren't completely useless. From a practical standpoint, those extras might push you over the edge. But be aware that your performance gains are coming from outsourcing the hand-engineering of features and not the data structure itself.

My personal opinion is to keep the data structure as simple as possible. MLEs and Data Scientists are mediocre systems engineers and it is okay to accept that. You want your ML & product team to be able to iterate on the search-logic and quality as fast as possible. That's where the real gains will come from. Speaking from experience, premature optimization in a new field will slow your team down to a crawl. IE. Go with postgres if that's what's simple for everyone to work with.

tldr: It's not about the scalability of the datastructure, it's about how you use it.

This item has no comments currently.