Preferences

nirga
Joined 216 karma
Traceloop W23

  1. Hey! We don’t support an otel collector directly - you have to connect it to some backend. Minimal one can be jaeger for example
  2. Sorry didn’t know that. It’s not my article so didn’t want to attribute the title to myself
  3. For example - the fact that the FE environment variables are hardcoded at build time makes it hard to just deploy a container
  4. I think that's the key benefit of using OpenTelemetry - it's pretty efficient and the performance footprint is negligible.
  5. Thanks for spotting those! We'll fix it asap
  6. I think you can (pretty) easily set this up with an otel collector and something that replays data from S3 - there's a native implementation that converts otel to clickhouse
  7. You can do it and it's a good way of doing that - from our experiments that can catch most errors. You don't even need to use different models - even using the same model (I don't mean asking "are you sure?" - just re-running the same workflow) will give you nice results. The only problem is that it's super expensive to run it on all your traces so I wouldn't recommend that as a monitoring tool.
  8. Thanks! It can vary greatly between use cases - but we've seen extremely high detection rates for tagged texts (>95%). When switching to production, this gets trickier since you don't know what you don't know (so it's hard to tell how many "bad examples" we're missing). Our false positive rate (number of examples that were tagged as bad but weren't) has been around 2-3% out of the overall examples tagged as bad (positive) and we always work on decreasing this.
  9. You're right. We faced those same issues. So we plan to move those prompts and completions to be sent as log events with some reference to the trace/span and not actually on the span.

    The span can then only contain the most important data like the prompt template, model that was used, token usage, etc. You can then split the metadata (spans and traces) and the large payloads (prompts + completions) to different data stores.

  10. Thanks! I wasn’t offended or anything, don’t get the wrong impression.

    What strikes me odd is the fact that an AI that checks AI is an issue. Because AI can mean a lot of things - from a encoder architecture, a neural network, or a simple regression function. And at the end of the day, similar to what you said - there was a human building and fine tuning that AI.

    Anyway, this feels more of a philosophical question than an engineering one.

  11. It has the same logic of saying you dont want to use a computer to monitor or test your code since it will mean that a computer will monitor a computer. AI is a broad term, I agree you can use GPT (or any LLM) to grade an LLM in an accurate way but that’s not the only way you can monitor.
  12. I replied to you in a different thread, I don't think calling our companies "deceptive" will help you or me get anywhere. While I agree with you that detection will never be hermetic, I don't think this is the goal. By design you'll have hallucinations and the question should be how can you monitor the rate and look for changes and anomalies.
  13. I'm sorry but this is not what we do. We don't use LLMs to grade your LLM calls.
  14. I think that LLMs are hallucinating by design. I'm not sure we'll ever get to a 0% hallucinations and we should be ok with it (at least for the next coming years?). So getting an alert on hallucination becomes less interesting. What is more interesting perhaps is knowing the rate that this happens. And keeping track on whether this rate increases or decreases with time or with changes to models.
  15. I think it depends on the use case and how you define hallucinations. We've seen our metrics perform well (=correlates with human feedback) for use cases like summarization, RAG question-answering pipeline, and entity extraction.

    At the end of the day things like "answer relevancy" are pretty dichotomic in a sense that for a human evaluator it will be pretty clear whether an answer is answering a question or not.

    I wonder if you can elaborate on why you claim that there's no ability to detect with any certainty hallucinations.

  16. Ping me over slack (traceloop.com/slack) or email nir at traceloop dot com
  17. roger that! I like them though (am I a normie then?)
  18. I tend to find classic NLP metric more predictable and stable than "LLM as a judge" metrics so I'd try to see if you rely on them more.

    We've written a couple of blog posts about some of them: https://www.traceloop.com/blog

  19. We trained our own models for some of them, and we combined some well known NLP metrics (like Gruen [1]) to make this work.

    You're right that it's hard to figure out how to "trust" these metrics. But you shouldn't look at them as a way to get an objective number about your app's performance. They're more of a way to detect deltas - regressions or changes in performance. When you get more alerts, or more negative results (or less alerts / less negative results) - you can tell you're improving. And this works for tools like RAGAS as well as our own metrics in my view.

    [1] https://www.traceloop.com/blog/gruens-outstanding-performanc...

  20. I know! When we started every time I was googling "traceloop" this was the first result.

    2 reasons why we chose it (in this order):

    1. traceloop.com was available

    2. we work with traces

  21. I have it internally, I can share it if you want!

    But to the point of comparison between these and tools like Traceloop - it's interesting to see this space and how each platform takes it's own path and finds its own use cases.

    LangSmith works well within the LangChain ecosystem together with LangGraph, LangServe. But if you're using LlamaIndex, or even just vanilla OpenAI you'll be spending hours to set up your observability systems.

    Braintrust and Humanloop (and to some extend other tools I saw in this area) take the path of "full development platform for LLMs".

    We try to look at it as developers look at tools like Sentry. Continue working in your own IDE with your own tools (wanna manage your prompts in a DB or in git? Wanna use LLMs your own way with no frameworks? no problem). We install in your app, with one line and we work around your existing code base and make monitoring, evaluation and tracing work.

  22. Great question and I see you already got a similar answer but I'll add some of my thoughts on this. We are actively promoting OpenLLMetry as a vendor agnostic way of observing LLMs (see some examples [1], [2]). We believe that people may start with whatever vendor they work with today and may gradually shift or use something like Traceloop because of specific features we have - for example the ability to take the raw data that we output with OpenLLMetry and add another layer of "smart metrics" (like qa relevancy, faithfulness, etc.) that we calculate on our backend / pipelines; or better tooling around observability of LLM calls, agents, etc.

    [1] https://docs.newrelic.com/docs/opentelemetry/get-started/tra...

    [2] https://docs.dynatrace.com/docs/observe-and-explore/dynatrac...

  23. Thanks so much! I always say that I'm a strong believer in open protocols so I'd love to assist you if you want to use OpenLLMetry as your SDK. We onboarded other startups / competitors like Helicone and Honeyhive and it's been tremendously successful (hopefully that's what they'll tell you as well)
  24. Thanks!

    We differentiate in 2 ways:

    1. We focus on real-time monitoring. This is where we see the biggest pain with our customers, so we spent a lot of time researching and building the right metrics that can run at scale, fast and at low cost (and you can try them all in our platform).

    2. OpenTelemetry - we think this is the best way to observe LLM app. It gives you a better understanding of how other parts of the system are interacting with your LLM. Say you're calling a vector DB, or making an HTTP call - you get them all on the same trace. It's also better for the customers - they're not vendor locked to us and can easily switch to another platform (or even use them in parallel).

  25. OpenLLMetry creator here. We’re building the most popular OpenTelemetry instrumentation for LLm providers including OpenAI, Anthropic, Pinecone, Langchain and >10 others since last August [1]. We’re using import hooks (like other otel instrumentations), and offer an sdk with a one line install. I’m also aware of other OSS initiatives doing similar initiatives, so I wouldn’t say no one has ever done what your doing.

    [1] https://github.com/traceloop/openllmetry

  26. How does it do that for ruby for example? (which is in the link you provided). OTEL instrumentation for HTTP doesn’t instrument the body so you won’t be able to see token usage, prompts and completions. Or am I missing something?
  27. But if you have a high variance when calculating a specific score for the same text output - how can it even be useful? Let's say you get score 20 for text A and then score 40 for text B - you can't infer that text A is necessarily worse than text B.
  28. I wouldn't call it misleading marketing - it is what it is, similar to what you can get today from tools like Langsmith, etc - Observability for the LLM part of your system, but using your existing tools. You can further extend that to monitor specific LLM outputs - but that's just another layer on top of that.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal