Comment by b0a04gl - Hacker Neue

b0a04gl 4 days ago parent

tbh that's not the flex. storing 100PB of logs just means we haven't figured out what's actually worth logging. metrics + structured events can usually tell 90% of the story. the rest? trace level chaos no one reads unless prod's on fire. what'd could've done better be: auto pruning logs that no alert ever looked at. or logs that never hit a search query in 3 months. call it attention weighted retention. until then this is just high end digital landfill with compression

hnlmorg 4 days ago

I’m of the opposite opinion. It’s better to ingest everything and then filter out the stuff you don’t want at the observability platform.

The problem of filtering out debug logs is you don’t need them, until you do. And then trying to recreate an event you can’t even debug is often impossible. So it’s easier to then retrieve those debug logs if they’re already there but hidden.

hinkley 4 days ago

Once a bug is closed the value of those logs starts to decay. And the fact is that we get punished for working on things that aren’t in the sprint, and working on “done done” stories is one of those ways. Even if you want to clean up your mess, there’s incentive not to. And many of us very clearly don’t like to clean up our own messes, so a ready excuse gets them out of the conversation about a task they don’t want to be voluntold to do.

hnlmorg 4 days ago

In DevOps (et al) the value of those logs doesn’t decay in the same way it does in pure dev.

Also, as I pointed out elsewhere, modern observability platforms enable a way to have those debug logs available as an archive that can be optionally ingested after an incident but without filling up your regular quota of indexed logs. Thus giving you the best of both worlds (all logging but without the expense and flooding your daily logs with debug messages)

hinkley 3 days ago

> In DevOps (et al) the value of those logs doesn’t decay in the same way it does in pure dev.

I’ve been on-call, and I think you’re cherry picking. The world has too many devs who still debug with log statements. Those logs never had any value to anyone but the original author.

I’ve also seen too many devs who are perfectly happy trying to write vastly complex Splunk queries to generate charts, and those charts tend to break in a production incident becausea bunch of people load them at once and blow up Splunk’s rate limiting. I’ve almost never had this problem with grafana. It’s true that you can make a dashboard with long-term trends that will fall over, but you wouldn’t use that dashboard for triage, unless you make one that tries to do both and the solution is split it into two dashboards.

If you want to make a successfully scaling organization, you need a way for new members to join your core of troubleshooters, without pulling resources away from solving the trouble. So they can’t demand time, resources or attention that are in short supply from the core group.

Grafana fits that yardstick much better than log analyzers.

hnlmorg 3 days ago

You’re arguing a different argument.

You’re making a case that cryptical logs messages are bad. And I agree.

You’re also making a case that logs are only piece of the telemetry ecosystem. And I agree there too.

What I’m arguing is that there isn’t a need to filter logs based on cost because you can still work with them in observability platforms in a cost effective way.

Lastly, I didn’t say everything should be instantly available. Long term logs shouldn’t be in the same expensive storage pool as recent logs. But there should be a convenient way to import from older log archives into your immediate log querying tools (statement here is intentionally vague because different observability platforms will engineer this differently and call this process by different names)

As for complex queries, regardless of how easy to use your observability platform is, however many saved queries and dashboards you have built, there’s always going to be a need for upskilling your staff. That’s an inescapable problem.

pstuart 4 days ago

My approach for this is to add dev logging IN ALL CAPS so that it stands out as ugly and "need adjusting", which is to delete it before merging to main.

hinkley 3 days ago

On my last project I was able to convince the team to clean up feature toggles before closing out epics. But I didn’t make much headway on logs. I came at them sideways and got all but one of my coworkers to stop trying to generate charts from Splunk and use Grafana instead. And I squeezed him by adding stats for things he liked to look at b

jgalt212 4 days ago

> then filter out the stuff you don’t want

This is often easier said than done. And there's ginormous costs associated with logging everything. Money that can be better spent elsewhere.

Also, logging everything creates yet another security hole to worry about.

phillipcarter 4 days ago

> And there's ginormous costs associated with logging everything

If you use a tool that defaults the log spew to a cheap archive, sampling to the fast store, and a way to pull from the archive on-demand much of that is resolved. FWIW I think most orgs get big scared at seeing $$$ in their cloud bills, but don't properly account for time spent by engineers rummaging around for data they need but don't have.

nijave 4 days ago

>but don't properly account for time spent by engineers rummaging around for data they need but don't have

This is a tricky one that's come up recently. How you you quantify the value of $$$ observability platform? Anecdotally I know robust tracing data can help me find problems in 5-15 minutes that would have taken hours or days with manual probing and scouring logs.

Even then you have the additional challenge of quantifying the impact of the original issue.

phillipcarter 4 days ago

At the end of the day it's just vibes. If the company is one that sees:

- Reliability as a cost center

- Vendor costs are to be limited

- CIO-driven rather than CTO-driven

Then it's going to be a given that they prioritize costs that are easy to see, and will do things like force a dev team to work for a month to shave ~2k/month off of a cloud bill. In my experience, these orgs will also sometimes do a 180 when they learn that their SLAs involve paying out to customers at a premium during incidents, which is always very funny to observe. Then you talk to some devs and they say things like "we literally told them this would happen years ago and it fell on deaf ears" or something.

behemot 2 days ago

> Anecdotally I know robust tracing data can help me find problems in 5-15 minutes that would have taken hours or days with manual probing and scouring logs.

exactly. high-cardinality, wide structured events are the way.

hnlmorg 4 days ago

Not really. Most observability platforms already have tools to support this kind of workflow in a more cost effective way.

> Also, logging everything creates yet another security hole to worry about.

I think the real problem isn’t logging, it’s the fact that your developers are logging sensitive information. If they’re doing that, then it’s a moot point if those logs are also being pushed to a third party observability platform or not because you’re already leaking sensitive information.

jgalt212 4 days ago

Fair enough, but if you don't push them to "log everything" there are less chances for error.

hnlmorg 4 days ago

I disagree.

If developers think “log everything” means “log PII” then that developer is a liability regardless.

Also, this is the sort of thing that should get picked up in non-prod environments before it becomes a problem.

If you get to the point where logging is a risk then you’ve had other failures in processes.

hinkley 4 days ago

Java had particularly bad performance for logging for a good while and I used to make applications noticeably faster by clearing out the logs nobody cared about anymore. Just have to be careful about side effects in the log lines.

UltraSane 4 days ago

You really need to define how much you are willing to spend on logging/observability as a percentage of total budget. IMHO 5% is bare minimum 10% is better. I've worked for a company that had a dedicated storage array just for logging with Splunk and it was amazing and very much worth the money.

Good automatic tiering for logs is very useful as the most recent logs tend to be the most useful. I like NVMe -> hard disk -> tape library. LTO tape storage is cheap enough you don't need to delete data until it is VERY old.

nikolayasdf123 3 days ago

that looks like tail-error sampling

gavinray 4 days ago

"Better to have it and not need it; than to need it, and not have it..."

jodrellblank 4 days ago

“You can’t have everything. Where would you put it?” - Steven Wright.

“Better to have hoarding disorder than to need a fifty year old carrier bag full of rotting bus tickets and not have one” really should need more justification than a quote about how convenient it is to have what you need. The reason caches exist as a thing is so you can have what you probably need handy because you can’t have everything handy and have to choose. The amount of things you might possibly want or need one day - including unforeseen needs - is unbounded, and refusing to make a decision is not good engineering, it’s a cop-out.

Apart from cost, the more time and money you spend indexing, cataloging, searching it. How many companies are going to run an internal Google-2002 sized infrastructure just to search their old hoarded data?

hnlmorg 2 days ago

This is a really easy problem to solve.

Step one: add log severity to your log messages (pretty much every log library supports this out of the box).

Step two: add a log archive (you should have this anyway so that logs can be retained past the initial retention period of your log querying tools. Eg you might have a compliance requirement to keep logs for two years but you obviously wouldn’t want anything that old stored in your expensive fast log search)

Step three: create a way to ingest your archived logs (again, something your business should have, otherwise what’s the bloody point in having an archive)

Step four: have a rule that pushes logs of high severity straight into your log ingestion pipeline, and logs of lower severity into your archive.

Step four seems to be the piece that most people are oblivious too. But it’s generally really easy to implement. Particularly so if you’re using a reputable observability platform.

People who think “log everything” means “log PII” or “stick everything in the same log ingestion pipeline” are simply doing logging wrong. I’m not normally one to say “you’re doing it wrong” but when it comes to logging, these tools are long since mature now. The problem isn’t the tooling, it’s people’s awareness of it.

gavinray 4 days ago

I'm not sure what poor engineering practices you have seen, but in my painfully-gotten experience, application of this principle usually amounts to having varying levels of a debug log flag that dump this info either to stdout via JSONL that's piped somewhere, or as attributes in OTEL spans.

This has never been a source of significant issues for me.

lelanthran 3 days ago

> "Better to have it and not need it; than to need it, and not have it..."

Having it is pointless if your SNR is so low that it costs more money than simply waiting for the bug the next time it comes up.

IMO, if a bug never surfaces again, that's not a bug I care about anyway. Keeping all generated data in case someone wants to see the record from a bug 3 months ago is absolutely pointless - if it hasn't surfaced again in the last three weeks, you absolutely have more high-priority things to look at!

I want to see this mythical company, where a paid employee is dedicated by the company to look at a log from 3 months ago, to solve a bug that hasn't resurfaced in that three month period!

9dev 4 days ago

Until you’re working with personal information of EU customers, where the opposite maxime applies: "Only store what you absolutely need"

Seriously, storing petabytes of logs is a guarantee for someone on your team writing sensitive data to logs, and/or violate regulations.

jkogara 4 days ago

Or more succinctly, albeit less eloquently: "Better to be looking at it than looking for it."

Macha 4 days ago

I've been in a bunch of companies that have pushed for reducing logs in favour of metrics and a limited set of events, usually motivated by "we're using datadog and it's contract renewal time and the number is staggering".

The problem is, if you knew what was going to go wrong, you'd have fixed it already. So when there's a report that something did not operate correctly and you want to find out WTF happened, the detailed logs are useful, but you don't know which logs are useful for that unless you have reoccuring problems.

Spivak 4 days ago

> trace level chaos no one reads unless prod's on fire

God why do we keep these fire extinguishers around, they sit unused 99.999% of the time.

jiggawatts 4 days ago

“Just go back in time and turn on the specific log you will need!”

ethan_smith 4 days ago

The "attention weighted retention" concept is brilliant. You could implement this with a simple counter tracking query/alert hits per log pattern, then use that for TTL policies in most observability platforms. This approach reduced our storage costs by 70% while preserving all actionable data.

hinkley 4 days ago

That logging isn’t even free on the sending side, especially in languages where they are eager to get the logs to disk in case the final message reveals why the program crashed.

And there’s a lot of scanning blindness out there. Too much extraneous data can hide correlations between other logs entries. And there’s half life in value of logs written for bugs that are already closed, and it’s fairly short.

I prefer stats because of the way they get aggregated. Though for GIL languages some models like OTEL have higher overhead than they should.

eddd-ddde 3 days ago

Is there any tools that does log/trace capture on error conditions? I.e. we capture all local events, but only upload them when something meaningful happens, like the server crashed / requests are returning 5xx.

mdaniel 3 days ago

I love this idea in principle, but in practice I would guess it means one of two sub-optimal things: either the node caches them for a window of time, in order to know whether to really transmit them, or the logs are mutated post-delivery as kind of a "tiny expiry"

Everything else I could write is just turning various trade-off knobs, which is why I'd guess you haven't seen an out-of-the-box offering that does what you're describing. There's not just one solution to it that would be reasonable for all audiences

nijave 4 days ago

In fairness, I think a lot of GIL languages already have high overload and I've never been under the impression OTEL was optimized for performance and efficiency.

hinkley 3 days ago

It really isn’t. The code reads like it was designed by SpringBoot users. You have to read three different docs to suss out how to use multiple calls together to get a desired approach, and some of the docs leave out critical details. I think people forget that folks use Google thinks is the top result isn’t necessarily what the creators would assume is the document people would find for a topic. I’ve been trying to explain this to the Elixir community for instance.

“Can’t to X, doesn’t work.”

“Look, it’s easy. Did you even RTFM? http://blah.example.com/doc/articleb#section2”

“Uh, no, because search engine took me to http://blah.example.com/doc/articleg#section7”

imiric 4 days ago

Sure, but if the data is already there, it's a sifting and pruning problem, which can be done after ingestion, if needed.

It's better to have all data and not need it, than to need it and not have it. Assuming you have the resources to ingest it in the first place, which seems like the focus of the optimization work they did.

__MatrixMan__ 4 days ago

> auto pruning logs that no alert ever looked at

I'm sure someone somewhere is working on an AI that predicts whether a given log is likely to get looked at based on previous logs that did get looked at. You could store everything for 24h, slightly less for 7d, pruning more aggressively as the data gets stale so that 1y out the story is pretty thin--just the catastrophes.

CoolCold 2 days ago

wut?

> As you’ll read below, this saves us millions of dollars a year and allows us to scale out our ClickHouse Cloud service without having to be concerned about observability costs, or make compromises on the log data we retain.

https://clickhouse.com/blog/building-a-logging-platform-with...

behemot 2 days ago

hey there! I work at ClickHouse. to clarify: the vast majority of this 100PB is structured events. in our case logs are supplementary.

solatic 3 days ago

If you work for a large enterprise, there are so many dev teams supporting so many products that "we haven't figured out what's actually worth logging" is just disconnected from the developer incentives in those teams (ship features fast, fix your problems even faster because nobody has time for that BS) as well as ops incentives (the servers ARE on fire, and the devs didn't log enough). FinOps comes last, if there's even cost tracking per team in the observability suite.

You don't understand why DataDog has a $44 billion market cap. It's yet another instance of Finance complaining that the transition to The Cloud gave every engineer a corporate credit card with no spend controls or a way for Finance to turn off the spigot.

namanyayg 4 days ago (dead)

nikolayasdf123 4 days ago

yeah, same thoughts.

business events + error/tail-sampled traces + metrics

... and logs in rare cases when none of the above works. logs are dump of everyting. why would you want to have so many logs in first place? and then build whole infra to scale that? and who and how reads all those logs? they build metrics on top of that? so might as well just build metrics directly and purposefully? with such high volume, even LLMs would not read them (too slow and too costly).. and what would even LLM tell from those logs? (may be sparce/low signal, hard to decipher without tool-calling, like creating merics)

This item has no comments currently.