- nirga parentHey! We don’t support an otel collector directly - you have to connect it to some backend. Minimal one can be jaeger for example
- 24 points
- You can do it and it's a good way of doing that - from our experiments that can catch most errors. You don't even need to use different models - even using the same model (I don't mean asking "are you sure?" - just re-running the same workflow) will give you nice results. The only problem is that it's super expensive to run it on all your traces so I wouldn't recommend that as a monitoring tool.
- Thanks! It can vary greatly between use cases - but we've seen extremely high detection rates for tagged texts (>95%). When switching to production, this gets trickier since you don't know what you don't know (so it's hard to tell how many "bad examples" we're missing). Our false positive rate (number of examples that were tagged as bad but weren't) has been around 2-3% out of the overall examples tagged as bad (positive) and we always work on decreasing this.
- You're right. We faced those same issues. So we plan to move those prompts and completions to be sent as log events with some reference to the trace/span and not actually on the span.
The span can then only contain the most important data like the prompt template, model that was used, token usage, etc. You can then split the metadata (spans and traces) and the large payloads (prompts + completions) to different data stores.
- Thanks! I wasn’t offended or anything, don’t get the wrong impression.
What strikes me odd is the fact that an AI that checks AI is an issue. Because AI can mean a lot of things - from a encoder architecture, a neural network, or a simple regression function. And at the end of the day, similar to what you said - there was a human building and fine tuning that AI.
Anyway, this feels more of a philosophical question than an engineering one.
- It has the same logic of saying you dont want to use a computer to monitor or test your code since it will mean that a computer will monitor a computer. AI is a broad term, I agree you can use GPT (or any LLM) to grade an LLM in an accurate way but that’s not the only way you can monitor.
- I replied to you in a different thread, I don't think calling our companies "deceptive" will help you or me get anywhere. While I agree with you that detection will never be hermetic, I don't think this is the goal. By design you'll have hallucinations and the question should be how can you monitor the rate and look for changes and anomalies.
- I think that LLMs are hallucinating by design. I'm not sure we'll ever get to a 0% hallucinations and we should be ok with it (at least for the next coming years?). So getting an alert on hallucination becomes less interesting. What is more interesting perhaps is knowing the rate that this happens. And keeping track on whether this rate increases or decreases with time or with changes to models.
- I think it depends on the use case and how you define hallucinations. We've seen our metrics perform well (=correlates with human feedback) for use cases like summarization, RAG question-answering pipeline, and entity extraction.
At the end of the day things like "answer relevancy" are pretty dichotomic in a sense that for a human evaluator it will be pretty clear whether an answer is answering a question or not.
I wonder if you can elaborate on why you claim that there's no ability to detect with any certainty hallucinations.
- I tend to find classic NLP metric more predictable and stable than "LLM as a judge" metrics so I'd try to see if you rely on them more.
We've written a couple of blog posts about some of them: https://www.traceloop.com/blog
- We trained our own models for some of them, and we combined some well known NLP metrics (like Gruen [1]) to make this work.
You're right that it's hard to figure out how to "trust" these metrics. But you shouldn't look at them as a way to get an objective number about your app's performance. They're more of a way to detect deltas - regressions or changes in performance. When you get more alerts, or more negative results (or less alerts / less negative results) - you can tell you're improving. And this works for tools like RAGAS as well as our own metrics in my view.
[1] https://www.traceloop.com/blog/gruens-outstanding-performanc...
- I have it internally, I can share it if you want!
But to the point of comparison between these and tools like Traceloop - it's interesting to see this space and how each platform takes it's own path and finds its own use cases.
LangSmith works well within the LangChain ecosystem together with LangGraph, LangServe. But if you're using LlamaIndex, or even just vanilla OpenAI you'll be spending hours to set up your observability systems.
Braintrust and Humanloop (and to some extend other tools I saw in this area) take the path of "full development platform for LLMs".
We try to look at it as developers look at tools like Sentry. Continue working in your own IDE with your own tools (wanna manage your prompts in a DB or in git? Wanna use LLMs your own way with no frameworks? no problem). We install in your app, with one line and we work around your existing code base and make monitoring, evaluation and tracing work.
- Great question and I see you already got a similar answer but I'll add some of my thoughts on this. We are actively promoting OpenLLMetry as a vendor agnostic way of observing LLMs (see some examples [1], [2]). We believe that people may start with whatever vendor they work with today and may gradually shift or use something like Traceloop because of specific features we have - for example the ability to take the raw data that we output with OpenLLMetry and add another layer of "smart metrics" (like qa relevancy, faithfulness, etc.) that we calculate on our backend / pipelines; or better tooling around observability of LLM calls, agents, etc.
[1] https://docs.newrelic.com/docs/opentelemetry/get-started/tra...
[2] https://docs.dynatrace.com/docs/observe-and-explore/dynatrac...
- Thanks so much! I always say that I'm a strong believer in open protocols so I'd love to assist you if you want to use OpenLLMetry as your SDK. We onboarded other startups / competitors like Helicone and Honeyhive and it's been tremendously successful (hopefully that's what they'll tell you as well)
- Thanks!
We differentiate in 2 ways:
1. We focus on real-time monitoring. This is where we see the biggest pain with our customers, so we spent a lot of time researching and building the right metrics that can run at scale, fast and at low cost (and you can try them all in our platform).
2. OpenTelemetry - we think this is the best way to observe LLM app. It gives you a better understanding of how other parts of the system are interacting with your LLM. Say you're calling a vector DB, or making an HTTP call - you get them all on the same trace. It's also better for the customers - they're not vendor locked to us and can easily switch to another platform (or even use them in parallel).
- OpenLLMetry creator here. We’re building the most popular OpenTelemetry instrumentation for LLm providers including OpenAI, Anthropic, Pinecone, Langchain and >10 others since last August [1]. We’re using import hooks (like other otel instrumentations), and offer an sdk with a one line install. I’m also aware of other OSS initiatives doing similar initiatives, so I wouldn’t say no one has ever done what your doing.
- I wouldn't call it misleading marketing - it is what it is, similar to what you can get today from tools like Langsmith, etc - Observability for the LLM part of your system, but using your existing tools. You can further extend that to monitor specific LLM outputs - but that's just another layer on top of that.