Profile: nirga - Hacker Neue

nirga

Joined Mar 31, 2022 216 karma

Traceloop W23

nirga Nov 18, 2025 parent

Hey! We don’t support an otel collector directly - you have to connect it to some backend. Minimal one can be jaeger for example
nirga Mar 22, 2025 parent

Sorry didn’t know that. It’s not my article so didn’t want to attribute the title to myself
nirga Mar 22, 2025 parent

For example - the fact that the FE environment variables are hardcoded at build time makes it hard to just deploy a container
24 points Mar 22, 2025

Why I Won’t Use Next.js

8 comments nirga epicweb.dev
nirga Jul 19, 2024 parent

I think that's the key benefit of using OpenTelemetry - it's pretty efficient and the performance footprint is negligible.
nirga Jul 18, 2024 parent

Thanks for spotting those! We'll fix it asap
nirga Jul 18, 2024 parent

I think you can (pretty) easily set this up with an otel collector and something that replays data from S3 - there's a native implementation that converts otel to clickhouse
nirga Jul 18, 2024 parent

You can do it and it's a good way of doing that - from our experiments that can catch most errors. You don't even need to use different models - even using the same model (I don't mean asking "are you sure?" - just re-running the same workflow) will give you nice results. The only problem is that it's super expensive to run it on all your traces so I wouldn't recommend that as a monitoring tool.
nirga Jul 18, 2024 parent

Thanks! It can vary greatly between use cases - but we've seen extremely high detection rates for tagged texts (>95%). When switching to production, this gets trickier since you don't know what you don't know (so it's hard to tell how many "bad examples" we're missing). Our false positive rate (number of examples that were tagged as bad but weren't) has been around 2-3% out of the overall examples tagged as bad (positive) and we always work on decreasing this.
nirga Jul 18, 2024 parent

You're right. We faced those same issues. So we plan to move those prompts and completions to be sent as log events with some reference to the trace/span and not actually on the span.
The span can then only contain the most important data like the prompt template, model that was used, token usage, etc. You can then split the metadata (spans and traces) and the large payloads (prompts + completions) to different data stores.
nirga Jul 17, 2024 parent

Thanks! I wasn’t offended or anything, don’t get the wrong impression.
What strikes me odd is the fact that an AI that checks AI is an issue. Because AI can mean a lot of things - from a encoder architecture, a neural network, or a simple regression function. And at the end of the day, similar to what you said - there was a human building and fine tuning that AI.
Anyway, this feels more of a philosophical question than an engineering one.
nirga Jul 17, 2024 parent

It has the same logic of saying you dont want to use a computer to monitor or test your code since it will mean that a computer will monitor a computer. AI is a broad term, I agree you can use GPT (or any LLM) to grade an LLM in an accurate way but that’s not the only way you can monitor.
nirga Jul 17, 2024 parent

I replied to you in a different thread, I don't think calling our companies "deceptive" will help you or me get anywhere. While I agree with you that detection will never be hermetic, I don't think this is the goal. By design you'll have hallucinations and the question should be how can you monitor the rate and look for changes and anomalies.
nirga Jul 17, 2024 parent

I'm sorry but this is not what we do. We don't use LLMs to grade your LLM calls.
nirga Jul 17, 2024 parent

I think that LLMs are hallucinating by design. I'm not sure we'll ever get to a 0% hallucinations and we should be ok with it (at least for the next coming years?). So getting an alert on hallucination becomes less interesting. What is more interesting perhaps is knowing the rate that this happens. And keeping track on whether this rate increases or decreases with time or with changes to models.
nirga Jul 17, 2024 parent

I think it depends on the use case and how you define hallucinations. We've seen our metrics perform well (=correlates with human feedback) for use cases like summarization, RAG question-answering pipeline, and entity extraction.
At the end of the day things like "answer relevancy" are pretty dichotomic in a sense that for a human evaluator it will be pretty clear whether an answer is answering a question or not.
I wonder if you can elaborate on why you claim that there's no ability to detect with any certainty hallucinations.
nirga Jul 17, 2024 parent

Ping me over slack (traceloop.com/slack) or email nir at traceloop dot com
nirga Jul 17, 2024 parent

roger that! I like them though (am I a normie then?)
nirga Jul 17, 2024 parent

I tend to find classic NLP metric more predictable and stable than "LLM as a judge" metrics so I'd try to see if you rely on them more.
We've written a couple of blog posts about some of them: https://www.traceloop.com/blog
nirga Jul 17, 2024 parent

We trained our own models for some of them, and we combined some well known NLP metrics (like Gruen [1]) to make this work.
You're right that it's hard to figure out how to "trust" these metrics. But you shouldn't look at them as a way to get an objective number about your app's performance. They're more of a way to detect deltas - regressions or changes in performance. When you get more alerts, or more negative results (or less alerts / less negative results) - you can tell you're improving. And this works for tools like RAGAS as well as our own metrics in my view.
[1] https://www.traceloop.com/blog/gruens-outstanding-performanc...
nirga Jul 17, 2024 parent

I know! When we started every time I was googling "traceloop" this was the first result.
2 reasons why we chose it (in this order):
1. traceloop.com was available
2. we work with traces
nirga Jul 17, 2024 parent

I have it internally, I can share it if you want!
But to the point of comparison between these and tools like Traceloop - it's interesting to see this space and how each platform takes it's own path and finds its own use cases.
LangSmith works well within the LangChain ecosystem together with LangGraph, LangServe. But if you're using LlamaIndex, or even just vanilla OpenAI you'll be spending hours to set up your observability systems.
Braintrust and Humanloop (and to some extend other tools I saw in this area) take the path of "full development platform for LLMs".
We try to look at it as developers look at tools like Sentry. Continue working in your own IDE with your own tools (wanna manage your prompts in a DB or in git? Wanna use LLMs your own way with no frameworks? no problem). We install in your app, with one line and we work around your existing code base and make monitoring, evaluation and tracing work.
nirga Jul 17, 2024 parent

Great question and I see you already got a similar answer but I'll add some of my thoughts on this. We are actively promoting OpenLLMetry as a vendor agnostic way of observing LLMs (see some examples [1], [2]). We believe that people may start with whatever vendor they work with today and may gradually shift or use something like Traceloop because of specific features we have - for example the ability to take the raw data that we output with OpenLLMetry and add another layer of "smart metrics" (like qa relevancy, faithfulness, etc.) that we calculate on our backend / pipelines; or better tooling around observability of LLM calls, agents, etc.
[1] https://docs.newrelic.com/docs/opentelemetry/get-started/tra...
[2] https://docs.dynatrace.com/docs/observe-and-explore/dynatrac...
nirga Jul 17, 2024 parent

Thanks so much! I always say that I'm a strong believer in open protocols so I'd love to assist you if you want to use OpenLLMetry as your SDK. We onboarded other startups / competitors like Helicone and Honeyhive and it's been tremendously successful (hopefully that's what they'll tell you as well)
nirga Jul 17, 2024 parent

Thanks!
We differentiate in 2 ways:
1. We focus on real-time monitoring. This is where we see the biggest pain with our customers, so we spent a lot of time researching and building the right metrics that can run at scale, fast and at low cost (and you can try them all in our platform).
2. OpenTelemetry - we think this is the best way to observe LLM app. It gives you a better understanding of how other parts of the system are interacting with your LLM. Say you're calling a vector DB, or making an HTTP call - you get them all on the same trace. It's also better for the customers - they're not vendor locked to us and can easily switch to another platform (or even use them in parallel).
nirga May 1, 2024 parent

OpenLLMetry creator here. We’re building the most popular OpenTelemetry instrumentation for LLm providers including OpenAI, Anthropic, Pinecone, Langchain and >10 others since last August [1]. We’re using import hooks (like other otel instrumentations), and offer an sdk with a one line install. I’m also aware of other OSS initiatives doing similar initiatives, so I wouldn’t say no one has ever done what your doing.
[1] https://github.com/traceloop/openllmetry
nirga Apr 28, 2024 parent

How does it do that for ruby for example? (which is in the link you provided). OTEL instrumentation for HTTP doesn’t instrument the body so you won’t be able to see token usage, prompts and completions. Or am I missing something?
nirga Mar 20, 2024 parent

But if you have a high variance when calculating a specific score for the same text output - how can it even be useful? Let's say you get score 20 for text A and then score 40 for text B - you can't infer that text A is necessarily worse than text B.
nirga Feb 15, 2024 parent

I wouldn't call it misleading marketing - it is what it is, similar to what you can get today from tools like Langsmith, etc - Observability for the LLM part of your system, but using your existing tools. You can further extend that to monitor specific LLM outputs - but that's just another layer on top of that.

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous