We documented our approach in an HN post (https://www.hackerneue.com/item?id=37824547) a couple weeks ago. Today, we are open sourcing the framework we have developed.
The framework focuses on RAG data pipelines and provides scale, reliability, and data synchronization capabilities out of the box.
For those newer to RAG, it is a technique to provide context to Large Language Models. It consists of grabbing pieces of information (i.e. pieces of news articles, papers, descriptions, etc.) and incorporating them into prompts to help contextualize the responses. The technique goes one level deeper in finding the right pieces of information to incorporate. The search for relevant information is done through the use of vector embeddings and vector databases.
Those pieces of news articles, papers, etc. are transformed into a vector embedding that represents the semantic meaning of the information. These vector representations are organized into indexes where we can quickly search for the pieces of information that most closely resembles (from a semantic perspective) a given question or query. For example, if I take news articles from this year, vectorize them, and add them to an index, I can quickly search for pieces of information about the US elections.
To help achieve this, the Neum AI framework features:
Starting with built-in data connectors for common data sources, embedding services and vector stores, the framework provides modularity to build data pipelines to your specification.
The connectors support pre-processing capabilities to define loading, chunking and selecting strategies to optimize content to be embedded. This also includes extracting metadata that is going to be associated to a given vector.
The generated pipelines support large scale jobs through a high throughput distributed architecture. The connectors allow you to parallelize tasks like downloading documents, processing them, generating embedding and ingesting data into the vector DB.
For data sources that might be continuously changing, the framework supports data scheduling and synchronization. This includes delta syncs where only new data is pulled.
Once data is transformed into a vector database, the framework supports querying of the data including hybrid search using the available metadata added during pre-processing. As part of the querying process, the framework provides capabilities to capture feedback on retrieved data as well as run evaluations against different pipeline configurations.
Try it out and if interested in chatting more about this shoot us an email founders@tryneum.com
I assume that in a not so distant future, a malware scanner will detect this and disallow one to run this locally.
Generally we have found that recursive chunking and character chunking tend to be short sighted.
Why not capture a few strategies that the LLM returns as code that can be properly audited (and ran locally improving the overall performance)?
The idea of auditing the strategy is interesting. The flow that we have used for the semantic chunkers up to date has been along these lines where we : 1) Use the utility to generate the code snippets (and do some manual inspection) 2) Test the code snippets against some sample text 3) Validate the results
Yes, obviously useful for prototyping and creating hype articles & tweets with fun examples. However any engineer is capable of doing their own rag with the same effort (minimal data extraction using the ancient pdf/scrape tools that are still open sota, or use cloud ocr for best —-> brute force chunking —-> embed —-> load in Ann with complementary metadata store)
Anyone doing prod needs to know the intricacies and make advanced engineering decisions. There’s a reason there aren’t similar end-to-end abstractions over creating Lucene (solr/elastic) indexes. Hmm, why not after many decades? …
In reality, the RAG tech is not entirely novel— it’s etl. Which in reality, complex etl is often a serious data curation effort. LLMs are the closest thing to enabling better data curation, and as long as you aren’t competing with open ai (arguably any commercial system is) then you can use chatgpt to create your chunks.
Beyond this embedding strategies are nice to abstract but the best approach to embeddings still remains to create your own and figure out contextual integration on your own. Creating your own can also just be fine-tuning. Inference is often an ensemble depending on your use case.
Probably the main point I disagree with you is that RAG is just ETL. If that was the case, all of the AI apps people are building would be AMAZING because we solved the ETL problem years ago. Yet, app after app being released have issues like hallucinations and incorrect data. IMO the second you insert a non-deterministic entity in the middle of an ETL pipeline, it is no longer just ETL. To try to add value here, our focus has been on adding capabilities to the framework around data synchronization (which is actually more of a vector management problem), contextualization of data through metadata and retrieval (this part being were we have spent the least time to date, but are currently spending the most)
I went through building a RAG pipeline for a company and brought up at each stage how there's been no tuning, no efficacy testing for different scenarios, no testing of different chunking strategies, just the most basic work done and they released it almost immediately. Surprisingly to not much fan fare.
It doesn't really matter
https://github.com/topics/rag
https://github.com/Dicklesworthstone/fast_vector_similarity
I've had good results from starting with cosine similarity (using FAISS) and then "enriching" the top results from that with more sophisticated measures of similarity from my library to get the final ranking.
Today, it is mostly about convenience. We provide abstractions in the form of a pipeline that encompasses a data source, embed and sink definition. This means that you don't have to think about embedding your query or what class you used to add the data into the vector DB.
In the future, we have some additional abstractions that we are adding that will add more convenience. For example, we are working on a concept of pipeline collections so that you can search across multiple indexes but get unified results. We are also adding more automation around metadata given that as part of the pipeline configuration we know what metadata was added and examples of it, so we can help translate queries into hybrid search. I think about it as a self-query retriever from Langchain or Llama Index but that automatically has context of the data at hand. (no need to provide attributes)
Are there any specific retrieval capabilities you are looking for?
submissions by this user (https://news.ycombinator.com/submitted?id=picohen):
Show HN: Neum AI – Open-source large-scale RAG framework (github.com/neumtry)
Show HN: ElectionGPT – easy-to-consume information about U.S. candidates (electiongpt.ai)
Efficiently sync context for your LLM application (neum.ai)
Show HN: Neum AI – Improve your AI's accuracy with up-to-date context (neum.ai)
[1] https://www.llamaindex.ai
There are a couple areas where we think we are driving some differentiation.
1. The management of metadata as a first class citizen. This includes capturing metadata at every stage of the pipeline.
2. Be infra ready. We are still evolving this point, but we want to add abstractions that can help developers apply this type of framework to a large scale distributed architecture.
3. Enable different types of data synchronization natively. So far we enable both full and delta syncs, but have work in the pipeline to bring in abstractions for real-time syncing. 3.
Additionally, there might be a potential route where both are used, depending on the use case.
Feel free to dm if you want to chat further on this!
(Disclaimer: I am a Haystack maintainer)