Preferences


> The cost for something that can be replicated free and open source is absurd.

open source it may be. free it is not. paying an expert to correctly deploy an open source solution takes time and money.

oh you want it maintained?

the three recommendations sound like those of a consultant. they work great with exec buy-in, and are a joke without.

yes, yes, you just have to explain yourself. just help management understand. justification is part of the role. how long does that take, and how much effort? (the answer is subjective and contextual.)

to be blunt, this kind of advice is barely more than “do better.” it ignores the situational example of using snowflake cheaply. it acts like most devs aren’t just going to go fire up a postgres rds - you said open source! it oversimplifies all problems, implicitly, to you.

> Pay to make a problem disappear.

no, pay to change the parameters of the problem. this is a fundamental misunderstanding of how to get things done in a constrained environment. it isn’t either/or, and every saas comes with problems. you pay to trade problems. otherwise it wouldn’t be worth a blog post name dropping and shitting on databricks and snowflake costs. in the hands of an untrained user or in a sufficiently constrained environment, they cost a lot - that’s one of the problems you buy. cost management.

> a talk with your local Dev Ops Engineer/ manager to discuss how to secure your implementations

again, hope you’ve taken the time to build these bridges. the author does indicate this is most important.

and it has a cost. someone is building that bridge, and likely someone from each side.

you sometimes pay vendors to deal with technical problems so you can deal with non-technical ones.

most importantly, don’t assume a random stranger on the internet has sufficient context to give you worthwhile recommendations.

> paying an expert to correctly deploy

That's what the Databricks salespeople say. My consistent experience has been that the experts "learned" it over the weekend by reading over the brochure and now I not only need to get a deep understanding of it myself I also have to waste my time explaining it to them because they have to actually do it in the closed environment the Databricks is overcharging us for.

Sounds like you're now a Databricks expert. Time to go get that $300/hour consulting fee.
was referring to the open source alternatives.

have yet to meet a vendor-recommended solutions provider that felt like more than a cash grab for anyone too dazzled by the initial onboarding.

the math of outsourcing implementation to consulting firms always seems off - send the problem elsewhere, usually for a premium, and hope the internal folks can digest whatever the actual consulting dev writes after the sales dev talks big game on hypotheticals.

hesitant to say my sample size is too small. feels like a trap.

I don't need a degree in Individual Contributorism with a citation in Consultantology to know that Snowflake and Databricks are ripping people off.

I'm not sure why anyone would be especially passionate about figuring out the details how though. If you want to level a criticism at the author, it's spending time on blog posts instead of ostensibly making the super cheap, better data analytics platform he's advocating for.

I think that's fair. I also think it's fair to count on-prem versus hosted as a metric for your org that should be evaluated.

I think that what is missing from the article is that it's not a matter of free versus paid, but rather integrated solution versus dispersed solution with custom glue code. Depending on your organization and your constraints, you should define your budget, wants and needs, build metrics to represent those, evaluate all systems against your particular metrics, and come to a dispassionate result.

[Disclaimer: I work at ClickHouse] I see a lot of the responses here focus on the age-old ‘build v. buy’ debate.

It’s also worth considering the comparative cost of Snowflake against other saas warehouses or databases, depending on your needs. For instance, we’ve heard from users that ClickHouse Cloud can be much more cost-effective for many use cases when compared to Snowflake - real-time analytics is a great example. For those who can’t build (or run OSS themselves), this is another interesting (and important) dimension. What's great about ClickHouse is that it's open source, too :).

Vendor says their solution is cheaper than competitor solution. I'm shocked.
Big corps are risk management companies. The happy path is that you will have a working solution at a fraction of the cost. Mid to worst case scenario is that you end up over your head, the costs are higher than you anticipated and the end result is mediocre at best. Big corps are willing to pay a high premium to guarantuee a certain level of performance with penalties if necessary, especially if the solution is pretty far removed from their core business.
So, how these risk managers jump on new hyped tech, built by non profitable companies with vendor lock in and no way to escape?
>open source it may be. free it is not. paying an expert to correctly deploy an open source solution takes time and money. oh you want it maintained?

Funny. I heard this same kind of argument used against replacing oracle with postgres. It reminds me of Microsoft's "TCO" PR offensive back when they were public about how much they loathed open source competition.

Thing is that it wasnt just a straw man (nobody that needs to be convinced is under any illusions that software has to be maintained), Oracle was also way more expensive to maintain ON top of being expensive to run.

They had their hooks into that organization pretty good though and once that happens technology choices become highly political - which experts are you going to fire and which ones are you going to hire?

The oracle lot obviously didnt want it to be them that get managed out.

context matters.

you’re calling it an organization, that alone typically indicates a larger scale, in which supporting something internal might be feasible. or at least it is a term used by those that have been there. you’re also referencing politics, which, again, that’s highly suggestive of a specific type of experience. experience within scale if not at scale.

please don’t mistake my intent to be that open source never makes sense. with the right plan and personnel, it can work better.

with the right scale and support, open source projects are started, though they aren’t always open from day one.

I think you are vastly under-estimating the amount of cost/manpower needed for snowflake and cloud based solutions (they are anything but turn-key and have a lot of churn). There are cases for both but it's not as simple as open source = hire and cloud = no people needed.

With Snowflake you are paying for not only the development of the product, the hosting of it, the engineers to run it, but also the sales, marketing, and management behind it. But on top of all that you probably need to hire people to implement and maintain a solution utilizing it for your organization.

always fun when folks make assumptions.

> With Snowflake you are paying for not only the development of the product, the hosting of it, the engineers to run it, but also the sales, marketing, and management behind it.

and you can run the numbers to determine if the cost of ownership offsets headcount. what your dollar goes to isn’t part of that formula. only what your dollar gets you.

> But on top of all that you probably need to hire people to implement and maintain a solution utilizing it for your organization.

that is also a given. you would need at least as many people to implement analytics on top of the homebrew data warehouse, in addition to headcount to run the warehouse itself.

you seem to want to make this out to be vastly different spend amounts. it can be. sometimes it is, sometimes it is not, and the direction can vary.

Big corps are risk management companies. The happy path is that you will have a working solution at a fraction of the cost. Mid to worst case scenario is that you end up over your head, the costs are higher than you anticipated and the end result is mediocre at best. Big corps are willing to pay a high premium to guarantuee a certain level of performance with penalties if necessary.
...a talk with your local Dev Ops Engineer/ manager to discuss how to secure your implementations

This right here is exactly why Snowflake is a good fit for my org.: we could pay someone's salary to install, maintain, and upgrade some open source alternative (and the VPS to run it), or we could just pay for Snowflake and stop wasting engineering's time on the Data Science's team's stuff, which frees them up to move faster on the core platform features bringing in the big bucks.

This reads like a flavor of the same argument I've heard many times over. You know what the biggest cost of an org. really is as we head into a recession? Labor. Stop wasting it on stuff done better by specialized companies that free you up to get stuff done.

>This right here is exactly why Snowflake is a good fit for my org.: we could pay someone's salary to install, maintain, and upgrade some open source alternative (and the VPS to run it), or we could just pay for Snowflake and stop wasting engineering's time on the Data Science's team's stuff, which frees them up to move faster on the core platform features bringing in the big bucks.

The cost problem really is out of control. You'd come out ahead financially pretty quickly with a DIY solution. DBUs are on the order of ~100x the actual underlying compute cost. I simply cannot understand how anyone chooses this over running your own Spark clusters with Jupyterlab. There are single-click image deploys that do this stuff.

Here was my situation. Occasional queries. Over a couple petabyte of data. Customer facing so response in seconds per SLA but > 95 percent of the time the warehouse isn’t running. Cached queries from within 24 hours which don’t require the warehouse to even spin up. Our snowflake costs were significantly less than an FTE.

Would that potentially be a situation which “running your own” doesn’t make sense?

>Would that potentially be a situation which “running your own” doesn’t make sense?

Look into datalake architectures. RDBMS based data warehousing is obviously not economical at the petabyte scale. But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

> Look into datalake architectures.

Yup .. comfy with iceberg/delta/hudi

> RDBMS based data warehousing is obviously not economical at the petabyte scale.

I never said it was .. I'm simply responding to "I simply cannot understand how anyone chooses this over running your own Spark clusters with Jupyterlab". I'm trying to help you understand why folks would choose a SaaS over run your own.

> But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

No. You don't just pay for object storage + minor S3 read costs.

You pay for operations You pay for someone setting up conventions You pay to not have to optimize data layouts for streaming writes You pay to not have to discover race conditions in s3 when running multiple spark clusters writing to same delta tables You pay to not have to discover that your partitions/clustering needs have changed based on new data or query patterns

But look .. I get it. You have chosen to optimize for cost structures in one way .. and I've chosen to optimize in a different way. In the past I've done exactly as you've said as well. I think being able to seeking to see _why_ folks may have chosen a different path may help you understand other areas to consider in operations.

If you have petabytes of data, I don't think this article is talking about your use case.
I think it is?

Or I guess, what data size do you think it's talking about? If you only have gigabytes of data, none of this matters, you can use anything pretty cheaply and easily. So is this article just for "terabytes" or does it go up to "hundreds of terabytes" but not "petabytes"?

Where do you see DBUs being many-times multiples of the underlying compute? AFAICT they're basically a x% (for x << 100) overhead on what you'd pay AWS / Azure for the compute.

Of course, there's the argument that AWS et al are overpriced relative to running your own hardware... which this community can argue ad infinitum, but seems besides the point here.

DBUs are about +50% the underlying compute cost.

We use Databricks in production extensively.

> DBUs are on the order of ~100x the actual underlying compute cost

Where did you get that from? Sounds like it can only be true if you only take into account electricity costs, not capital or wages.

> we could pay someone's salary to install, maintain, and upgrade some open source alternative (and the VPS to run it), or we could just pay for Snowflake and stop wasting engineering's time

You are missing the "we could pay someone that provides a commodified, open-source alternative to Snowflake" option.

It would be cheaper, it wouldn't require any of your engineering team resources and it would help us all get rid of single vendor lock-in.

> You are missing the "we could pay someone that provides a commodified, open-source alternative to Snowflake" option.

Which services provider do you have in mind to pay? The advantage of Snowflake is that the CIO can pick Snowflake as the default for the entire org and it will work across a very large number of business use cases. I don't see any open source analytic software that can do that.

In my experience open source analytic solutions work much better for specific use cases, say network flow log analysis. You pick exactly the right set of components for that use case and build specifically to the problem. Maybe you use ClickHouse + Kafka--cheaper, simpler, faster. But when you shift to another problem, such as analyzing support case patterns, that requires unpredictable joins on large numbers of tables the first solution does not work at all. Now you need something like Presto on data lakes, which is completely different technology.

The alternative is to move all the data in to Snowflake and do both of the above use cases there. Sure, it's not the fastest or the cheapest software but the savings in labor more than make up for the licensing cost. A lot of companies make that decision.

> The advantage of Snowflake is that the CIO can pick Snowflake as the default for the entire org

We are talking about companies that do not even have a CIO, which is just another way to say "most companies".

>it wouldn't require any of your engineering team resources

other than the resources to evaluate the commodified open-source alternative. and the resources to manage the risk of that alternative, and the smaller company who runs it, going away.

"nobody ever got fired for buying IBM" is usually said in a snarky way, but there's real value in not only not having to think about the provider, but not having to think about which provider to choose. making that decision has a cost, and the more off the beaten path you go the higher the cost of that decision.

> "nobody ever got fired for buying IBM"

https://en.wikipedia.org/wiki/File:Survivorship-bias.svg

How many startups simply died or how many projects never got to be launched because they thought they needed some enterprise support from the start?

Would it be cheaper? Cloudera fits the bill and it’s for sure not cheaper
Because Cloudera is aimed at a market that is not the same that would benefit from "most companies" being talked about in TFA?

Plenty of companies that would be fine by using a managed postgresql database for their "data warehouse" and you can get, e.g, one reasonably powerful server at Digital Ocean for $60/month with managed backups - and that already has quite a big markup compared to their droplets because they are just packaging some monitoring tools on top of their existing systems.

I think most companies don't understand a large part of the Databricks offering and it should be used by way more organizations. Disclaimer: I was a Databricks user for 6 years and now work at Databricks.

Yea, you can create your own Spark deployment, but it will run much slower than the Databricks Runtime (DBR) or the Databricks proprietary Spark Runtime (Photon). Computations that run slower cause you to have a larger cloud compute bill. Databricks rewrote Spark in C++ and it runs really fast and saves a lot on ec2 compute.

> Define when you should compact files, when to Z-order

Or don't consider these issues and use autocompaction / the new Liquid clustering. These are great examples of problems the platform should solve, so the user has time to focus on business logic.

> If you can sniff out the inefficiencies in your Data early and make architecture that handles your specific data

I don't know what this means.

Are you going to build a deep learning model to make read/writes faster like Databricks predictive I/O? https://docs.databricks.com/en/optimizations/predictive-io.h.... Probably not, you have a lot of business problems to solve.

> Do the real work. Work with people. The Code will write itself.

I've seen lots of DIY data platforms. They're horrible to work with and I can assure you that the code does not write itself. The data engineers have a lot less time to write code because they're constantly trying to stand the platform back up.

> Are you going to build a deep learning model to make read/writes faster like Databricks predictive I/O?

It makes sense for Databricks to do this because they're building a product that needs to work for a high cardinality of datasets. Using a DNN for that is defensible because the input shapes are practically infinite. But for individual orgs, it seems much more likely that simple heuristics-driven access pattern optimizations can be done without throwing ML at the problem (though I'll say the predictive IO concept is a cool one, I've done similar ML work for network traffic QoS).

> I've seen lots of DIY data platforms. They're horrible to work with and I can assure you that the code does not write itself.

This one made me scratch my head a bit, because I have seen many DIY platforms as well over a 20 year career. Most of them have been amazing to work with. The code didn't write itself of course, but maintenance burdens were low and specialization/expertise within the org was high as a result. On top of that, I didn't need to argue with an AE about rising costs on a per-annum basis whenever renewal time comes around.

I point this out not in the service of coloring your observations as wrong or misguided, but to highlight that there's going to be a spectrum of varied lived experiences people have with the build-vs-buy conundrum. I expect we'll never see genuine consensus on this issue as a result of that variance (and maybe we don't have to).

I begrudgingly agree. We tried to stand up a spark cluster and there are just loads of configuration and little niggles that just get in the way, like making sure everything on your cluster has the libraries they need.

Putting it on databricks most of those problems went away and we could just get in with writing the code.

Could we have figured out how to get spark working in a bespoke configuration? almost certainly. By our mid September deadline? Probably not

True from a purely technical point-of-view, but doesn't take into account how companies make decisions about adopting platform tech and that they're making these choices for good reasons.

When companies choose SF or DB they don't make a decision to install one of these platforms because there is no combination of open-source components and bespoke engineering work that could replace them and even be more flexible and efficient. They are choosing these platforms because they make it easy to manage data operations holistically, because it's easier to hire people who already know them well, because they are opinionated and restrict the range of bad or just strange decisions someone in the organisation could be making if they were not restricted, and of course for enterprise support, available from a single reputable vendor.

See also: 80% of companies don't need Kubernetes, 80% of companies don't need a big-3 cloud, 80% of companies don't need managed security solutions, 80% of companies don't need SAP, 80% of companies don't need Salesforce, etc etc etc...

Many companies also install DB, k8, etc. because costs don't matter as much. In the times of easy money they focus on running fast, borrowing if needed.

When money is harder to borrow, as it is now, some of those decisions get reassessed. My 2c.

cost of migration is part of any assessment worth anything. rarely cheap, rarely on time.

not that some companies won’t bleed themselves out in a panicked effort to try and stop a paper cut.

Most companies don't need a lot of things.

I don't do a whole lot of data analytics anymore, but I'd say: Start with figuring out how much data you actually have. We still see companies claim to have vast amounts of data, but in reality they are talking about less than a TB of data, frequently just a few 100GB. When you operate at that scale, just chuck your data into whatever database you're comfortable with and do SQL queries, it's fine.

Once you hit the scale where you have "enough" data I'd agree with many of the other comments: Managing open source or home grown solutions quickly become more expensive than just paying for a service. Not quite the same, but we considered deploying a few solutions on OpenStack, but once you paid for training and staff it turned out to be cheaper to just give VMWare more money.

> We still see companies claim to have vast amounts of data, but in reality they are talking about less than a TB of data, frequently just a few 100GB. When you operate at that scale, just chuck your data into whatever database you're comfortable with and do SQL queries, it's fine.

I did some work with a company that talked about their “big data”, bragged about their data lake, and made grandiose statements about how they had big plans to use “all this data” to get an edge up on their competitors. They talked about the big data platforms and technologies they were evaluating and how they’d probably have to hire more people to handle it all.

When I pressed them for details on it, they eventually explained the data they had on each customer and the approximate number of customers. I did some quick math and calculated the total amount of data on all of their customers was under 100MB.

They had managed to blow it up to occupy a couple orders of magnitude more space in one situation (think storing time series data as JSON where every point had duplicated tags attached due to ineffective structure) but even that was small enough to fit entirely in RAM of a decent sized server.

Remember some industries are perversely incentivized to drive up costs because profits are limited as a percentage of total cost.
Could you please provide examples of that kind of industries?
Health insurance is the example that comes to mind.
You can buy a single server with TiBs of ram. That’s why I think motherduck/duckdb really shines. It’s the best solution for 99% of users whose data can be processed relatively quickly on a single machine.
I find it frustrating to search for stuff for mid-scale problems sometimes. There's a lack of context about what 'metrics' or 'analytics' means, and that leads to over-engineered solutions.

I run 2-3 (primarily database & web) servers. I thought it would be a nice idea to aggregate web accesses and stuff into a central location for analysis. It's probably 50-100MB of logs a day at most. But all the 'industry standard' solutions are for something at much bigger scale. Start with blah, spin that up on a k8s cluster, run this in docker, blah blah. FFS, I don't need that, I just want some nice graphs.

So I just shove it all into a postgres database (since we are already familiar with it) and visualize it with Grafana. I'm sure people would complain that it's not the proper way, but I prefer spending my time solving my users problems, not self-inflicted ones.

I feel this in my bones. I really find it hard to figure out the right solutions for scaling things that will be "bigger than small" but also "smaller than big" for the foreseeable future. That is, when the "starting point" conventional wisdom ("just use postgres / sqlite for everything; anything else is yagni") isn't quite up to snuff, but the "scale up" solutions are designed to handle far bigger use cases and bring some baggage along with that.

I have honestly found Snowflake pretty useful for this, because it seems pretty cheap to me for a small amount of usage, but is also elastic up to this middle-ground of usage. But I suspect that at "truly big" I would / will be interested in different solutions.

For the graphs I don't know, but I was happy to learn that Journald can actually ship logs to other Journald installations. Haven't used it yet, but I'm still looking of a place where it would make sense.
Fairly confident anyone outside of sales is going to tell you the solution that works is the best solution.
A Terabyte of data can be still vast, depending on how different and unstructured they are. If you have thousands of different logfiles-formats, datadump-formats and various complex fileformats, then you need a very different solution than you would need to handle a terabyte of uniform data in a csv-file or similar.
Spark wont inherently help you solve that problem though, and neither will most of the big data solutions.
Of course they do not need it. 80% of companies can also do without AWS, they "just" need to hire the people themselves, and developer a core competency running and administering those services. It doesn't automatically mean that it makes financial sense for said company.

I guess "don't need" == "must not use" when you're selling yourself as filling that self-hosted gap.

This is one of the things I hate of this industry: service providers oversell their proposition and overcharge to infinity, while independent professionals often undersell that same proposition so that they can capture some of that overcharging for themselves. If you are an honest actor, you're left pulling your hair out at the overall mendacity of the whole market.

Profit can be such a cancer.

> charging their customers for Ferraris

I've seen this sort of analogy before and I don't think it's the right one here: if I get a Ferrari, I know _exactly_ what I'm getting. I'm getting a really fast, really beautiful car.

I'm not too familiar with Snowflake, but I've suffered under Databricks on and off for about a decade now, and as far as I can tell, it's just a more expensive, closed way to do what I could do a lot faster and a lot easier if I didn't have to work around the obstacles that Databricks puts in my way that don't have any value other than being something that Databricks can charge my employer a ton of money for putting in my way.

The car example in the article is exactly backwards. Snowflake is the Toyota Camry. It's a mass market solution that fits a lot of basic use cases. Readers are conflating the licensing cost of Snowflake with total cost of ownership, which is quite different. The equation tilts further toward Snowflake as you load more data into it. It has a pretty efficient sharing model using virtual data warehouses.

If you need the Ferrari, that's when you build with open source. If I want real-time response at fixed cost on large amounts of data I'm going to take ClickHouse with a customized stack every time. But not every problem needs that solution.

(I work at a ClickHouse vendor and know its habits. But you could say the same about Druid or other databases.)

Theoretically, your employer is paying Databricks, because they'd rather pay them than have a harder time finding employees with actual skills (and paying them accordingly).

Additionally, it's somewhat of a standard now - so it's easy to hire other people in the org with skills using it.

But isn’t that something they would need to do anyway - find employees with actual skills?

I mean sure, it’s how the snake oil is sold - “buy so and so saas product and you wouldn’t need to hire and can reduce cost”. I’m surprised that line can still sell today because in software it’s rarely true unless you can merge excellent talent with said product.

If you work at a small startup - they are mostly parroting what large and mid-size companies are doing - regardless of the fact that they're likely to fail before they get to that size, and doing said thing might make their chances of failure higher.

If you work at a large company - having a larger pool of candidates to hire from is definitely worthwhile.

And, it is much easier to have a poor understanding of Databricks than a very good understanding of data engineering in general.

Someone with a poor understanding of Databricks might be able to do 80% of what you can do with a great understanding of data engineering in general.

We outsource janitors, lunches, and other aspects of the business. Some things are not your core competency where having additional people investment detracts from your core mission.

That and do you really think 99% of companies are able to build something better or cheaper than databricks. Likely not

I work as a solution architect at a consulting firm that builds analytical data platforms for customers. Our company has a partnership with Snowflake, which means all the solutions we build are pushed to use Snowflake. Their sales strategy is very Oracle-like and at least in my circles many Snowflake sales employees are ex-Oracle. This means our sales and Snowflake sales are the best of friends. Formally they'll deny kickbacks, but who knows?

For all my clients Snowflake is overkill when you look from the perspective of growth and scale. They'll never use that part of Snowflake. They might just do as well with DuckDB, Azure Synapse or any other analytical-oriented platform laying around.

What I do like with less-than-big use-cases is that (at least at Snowflake) you pay relatively little if you do relatively little data processing. It's not free, but it doesn't break the bank either.

> and at least in my circles many Snowflake sales employees are ex-Oracle.

given my previous experiences with Snowflake sales, this would not surprise me in the slightest. A small list of the incredibly pushy things SF sales teams did:

- called up junior team members and pushed them to sign a renewal deal - point blank refused to explain how credits translated to real world money when pressed about how much usage we were actually getting for our money - after being told not to, they continued to call team members and pressured them to commit to a renewal - TAC/support gave consistently vague and unhelpful explanations.

So yeah, Snowflake is an awful company and an overpriced product, and I hope it fails.

sorry, but, there’s something about this that doesn’t smell right.

junior ic being asked to sign a contract makes zero sense. that’s insanity. plenty of managers at plenty of companies lack that level of authority.

sales teams use slimy tactics. that’s a given. time limits, multi-years, etc, etc, etc. trying to get the wrong person to sign is a different category entirely.

maybe you ran across the “best of the best” sales people there, but, this sounds sufficiently outlandish that i have to question it a bit.

"Snowflake Inc. was founded in July 2012...by three data warehousing experts: Benoît Dageville, Thierry Cruanes and Marcin Żukowski. Dageville and Cruanes previously worked as data architects at Oracle Corporation... "

From: https://en.wikipedia.org/wiki/Snowflake_Inc.#History

The heart of Snowflake is a very solid data warehouse, and one with very few configuration options.

Yes there other bells and whistles, but it doesn’t feel as though you are buying something hugely overblown even if you are doing something very vanilla.

And as you say, it’s all consumption based pricing so I think it’s unfair to characterise their sales team as pushing the wrong solution into the market.

> you pay relatively little if you do relatively little data processing

Yep, surprised how few people have been mentioning this here.

Compared with running a managed postgres instance 100% of the time, running a snowflake xs warehouse for a few minutes a day can be significantly cheaper.

The reason is that Snowflake charges close to cost for object storage and they compress your data when they put it in.

However...that means they need to make it up elsewhere. When you do get around to running queries the markup on compute can be 5-10x (or potentially more) depending on the plan you are using. If you do constant, compute heavy aggregation Snowflake is not the right place to do it.

Yes, but my point is, lots of small businesses come out ahead on this.

Which is totally by design! Their whole model is that once it is no longer cost effective, it isn't worth the switching costs to leave. (Exactly like the AWS model, un-coincidentally.)

But I think lots of commenters here (at least at the time I wrote my original comment) seem to be missing that this isn't nefarious or anything, it's win-win for a lot of businesses.

If you want something fresh and new in this space, check out FeatureBase: https://featurebase.com/. We just added vector/embeddings support to the cloud product.
Managing data yourself is hard. It is hard not because it can't scale, but for more boring reasons like harder DB upgrades and migrations, snapshotting, access control, lacking libraries/SDK, lack of documentation, harder training of new employees etc. The number one reason for extended downtime I have seen in companies is that data is in some bad state. And good data engineers who could do all the things authors expects them to do are not easy to find and expensive.
Yeah. This post reads like the Dropbox comment.

I've worked with ad-hoc data solutions, and now I work with analysts who swear by Snowflake. It's the organization, the administration, the maintenance that Snowflake is solving for us, and it's turn-key. It's not easy or cheap to get that done with a local solution.

Would be really wild if 20%, a truly whopping number, of the companies did need Showflake or Databricks.
I think the author means 80% of existing databricks/snowflake customers, since they say “the rest are overpaying”.
> The Fortune 100 have a use case for these companies, the rest are overpaying.

There is a difference between "a use case for $data_platform" and "a data use case for $data_platform". Scope on the first one is the platform in $data_platform" and second is the data specifics requirements in $data_platform.

Working on a non-Fortune100 insurance company in Europe, almost all our use cases can be easily done on traditional RDBMS like SQL Server, or on BI tools like SAS. Thankfully with higher granularity over time, Excel usage is fading out constantly. No big data, no heavy computing necessary - at least from my point of view.

All setups in place today, can be called self-service platforms. With cautious estimations we have at least 100 of such "platforms" running since years, or even decades.

This situation implies direct, we have a use case for a $data_platform itself. Costs are the biggest driver here, mainly due to the hidden costs of keeping these 100 systems up and running. Governance and management of the data, locked-in in all these stores, processed by slow and ugly SQLs nobody understands anymore, and with an unknown state of data quality, is key today.

My experience is that there is a lot of milage for small and medium enterprises to use your normal RDBMS replication to create a copy of your OLTP DB that drives your business and run your analytics on that. And put up with complaining it can be slow to do inefficient analytics queries.

Really, after spending silly amounts of time and hassle (and money) on the fancy snowflake or whatever, you discover its not massively faster for those small and medium businesses. And now you have to keep paying for it and keep maintaining it, which is a often actually a bigger burden than keeping a normal RDBMS replica alive.

I think "put up with complaining it can be slow to do inefficient analytics queries" is why I would rather pay for some 3rd party tool that would get the complaints.

Business people want some queries that are slow no matter what you do. Then expect these to run in under a second. They don't care about some abstract things like operations complexity or big O - they want it here and now.

Even if your data fits on one server there are still queries that will take time and it is not going to be instant, even if you put whole db in RAM it is going to be fast but still I can imagine bunch of queries that will simply have to take couple of minutes anyway.

Basically correct. If you have less than tens-of-terabytes the big exciting stuff just isn't needed and isn't particularly helpful.

Focus on understanding your data and building a useful model.

Use Postgres if you can. Supplement with DuckDB and/or ClickHouse as needed. Or, use your cloud's columnar DB (Redshift, BQ, Azure whatever) because you can start and stop using it as you please without talking to sales or signing a contract.

If your data team doesn't have the requisite skills... well... consider that tech selection might not be your problem.

Modifying the argument, I'd say even the companies that do need Snowflake or Databricks for an analytical use case tend to get run over by the hype train post adoption and start using it everywhere as an anti-pattern and causing much bigger problems operationally.
100%.

as an organization, there's value in _limiting_ the number of tools you adopt. But you'll nearly always want to adopt a few right-sized tools rather than using the same sledgehammer for every single thing that looks remotely like a nail.

While I do agree that for most of companies those large SaaS solutions are overkill I do not think that DuckDB or similar is sufficient. Nowadays more of the companies really need to process large datasets.

I meet regularly companies that used PostgreSQL or something similar up to some point but then they have grown and it is not sufficient anymore. They need something scalable. It does not have to be large SaaS: in many cases small Clickhouse cluster is sufficient. Nevertheless not everything can be done using single server. Also even if customer knows exactly well what are their needs right now does needs will grow and change over time so it is reasonable to build something that is not only good enough for now. Of course building something absolutely "future proof" leads to extremes and high bills.

I would turn this on it’s head and say that most companies do need one or the other.

My first premise is that most companies will have a BI or Data and Analytics problem, whether it’s analysing their spend, revenue, operations, customer churn or something more interesting.

At that point, having an industry standard, fully managed, fully elastic and resilient platform with consumption based pricing sounds pretty appealing.

Yes I can run and administer a warehouse on EC2, but the total cost of manpower and servers with full resilience is going to be high, especially as you’ll have to add in analytics tools, ETL tools etc which Snowflake or Databricks might have included.

I’m a huge believer in both Snowflake and Databricks. Snowflake for BI and Databricks for anything more funky. The technology is on point and the business case stacks up for the most part.

The dirty secret of modern data management is that very few people really have big enough data to justify major amounts of data infrastructure.
My client is in the middle of migrating their Snowflake data to Databricks Unity Catalog.

To this day, I still do not understand why they love Databricks so much. It just looks like a Jupyter notebook to me. I know, it has Spark. So?

I’m trying to learn more so I can build something better, maybe.

Yeah, I'd like to know too, from everything I can see it's just a cloud-hosted notebook with a spark backend, which does have some benefits, but I can't tell why they got as big as they did.

Also from a more cynical side, I've interviewed at a lot of companies that ask if I've used it before, and I'm really temped to just lie, since it feels like something that could be picked up really easily.

99.99% of companies are not Google sized and don't need Google's solutions. Period.
Google-sized solutions tend to create the need for google sized solutions though.

Take a basic CRUD TODO app. Add metrics to every user interaction, every mouse movement. Now you need a time series database, kafka and a data lake. Break it down into microservices. Now you need kubernetes, and also structured logging, and probably an ELK stack. You also probably need Graphana to monitor your services. Because you have introduced hundreds of failure points, you also need high-availability, multiple regions, elastic scaling. Eventual consistency.

Thinking about it, this may be a corollary of Parkinson's law.
It definitely is. It also sort of applies to legacy enterprise, too, in a different flavor: "If the tool or system is easy to use and accessible (thing classic LAMP stack), then shadow IT is guaranteed to exist and start using it." The more modern version flavor is that "if a cloud service exists, then your enterprise architecture will expand to consume it [whether it's necessary or not]."
As a data engineer who's worked in a bunch of different contexts/companies I 100% agree most of the time snowflake/databricks is an unnecessary money sink. The main problem is most companies need the security of a managed service for cloud computing, and don't want to be locked out of scaling to a very large scale (with distributed compute). Unfortunately, I don't think there are a lot of options that meet those pretty simple requirements that aren't databricks or snowflake[1].

Sure, you could put your compute on single docker containers, but when your data gets too big your stuck and have to get someone expensive in to manage kubernetes, all that time, you can't compute on your data. Which is sorta the crux of it: databricks and snowflake are expensive, but not nearly as expensive as finding out you need them and not having them.

[1] if you're using python on was/gcp, I think coiled (coiled.io) is a rare exception to that pattern

I think your second point is vastly overstated for 99% of companies and their use cases. There are precious few situations that can't be handled decently well by a simple RDBMS, and a simple ETL process into an analytical data store for reporting. Almost everything that exist fits one of three buckets: 1) CRUD app, 2) reporting tool, 3) business process app. And almost all of this is for internal use only. It's really only a handful of enterprise things that need massive scalability and performance, and consumer-oriented stuff (much of which is pretty niche and also doesn't need massive global scalability).
I think you're right and I probably came over a bit strong in how likely I made out companies would be to use that level of scale. The point I was trying to make was more that because using something like RDBMS (maybe + VMs or something for data science) has a hard ceiling in terms of scale, even if companies are really unlikely to hit that ceiling, there is a weird incentive structure.

The costs of databricks or snowflake are really high, but probably not enough to bankrupt a company or division. If you're a small to midsize company and you realise you can't compute on your data, that could cause much more serious issues. Even if those issues are only feasible 1/100 times, it's a pretty scary risk to expect a CTO to ignore.

Considering what you get I think Snowflake is rather decently priced. It’s not easy to integrate and operate a bunch of open source tools that replaces it, especially not on a small scale. Also, in most cases you’ll pay a lot more for the people that use Snowflake than you do for Snowflake, so it makes more sense to focus on productivity.
Why not give in and pay for the SQL Server Hyperscale instance? Isn't there enough BS to worry about? Why continue to waste time on tired old OLTP/OLAP/scalability/etc. conversations in 2023?

Unless your business can continuously write >100 megabytes of transactional data per second, this solution would almost certainly address all of your needs forever and ever. Up to 100TB too. It just works. It offers transactions exactly the way most business expect them to be conducted. No weird code, no weird client libraries, nothing. It works more or less like it has for the last 30+ years.

I can tell you for a fact that simply setting up a gigantic shared database called "company" and getting the team connected to it did wonders for us. When you stop worrying about "will it scale" you can start to collaborate and do amazing things again.

If you do magically hit the "unicorn vector" and 100x your customer-base then you can start worrying about "will it scale", and thankfully you will have new revenue that can be deployed in the form of new hardware and new hires.
We make a product that smaller companies use; it’s much simpler to use and no consultants needed. We notice that even in bigger corps, departments import into our system and use that because they find the de facto systems painful to use and they don’t actually need that much power, ever.
Word. Most companies don't have nearly enough data to begin to justify them.
Which use cases cannot do without Snowflake or Databricks?
Any typical analytics use case, where a single dataset (think a single table) is less than 2TB and the total data size is less than 10-20 TB in the organization. If that's you, you very likely don't need Snowflake and definitely not Databricks. An RDBMS would suffice, I recommend Postgres.

These sizes are just my thumb rules. Your mileage may vary.

I have people in my network use Snowflake with a total data size of 50GB!!! It makes my blood boil.

years ago, had this one college professor, students would spend the time before class asking him to divide two fractional numbers, and he would rattle off a few decimal places after calculating, to cheers from the students that had their calculators out with the answers. the excitement would build with every correct decimal point.

he didn’t actually need a calculator.

but he did own one.

hopefully this does not cause your blood to go through a state change.

Am I right in understanding that all these companies are actually making deals for on-demand use in the future?
you can get started with many of these companies with just a credit card with monthly billing. just like the major cloud vendors.
> I have people in my network use Snowflake with a total data size of 50GB!!! It makes my blood boil.

I keep a whole bunch of small-ish datasets in Snowflake because it means I don't have to do all the work to janitor my own postgres install. I go with the db someone else maintains for me that I have, not the one that I wish I had.

Zero-copy, zero-etl data sharing.
Most people shouldn't need a 3GHz 8-core CPU in their pocket to look at cats or check weather, but here we are. Simplicity by bloat come at a price.
This article has valid points but does not understand the perspective of companies. Companies do not buy technologies. Companies buy solutions.

- Companies do not buy Spark, they buy the ability to process their data and to have multiple personas collaborate (data scientists, data engineers, ...)

- You can do it yourself. It will be cheaper but it will require time, expertise and money, all things that companies do not give easily

- Snowflake and Databricks are elastic: you can start small and grow as you need. This is much easier than justifying the upfront cost of hiring specialized people or asking for trust that your ad-hoc solution will respect whatever enterprise governance rules

(disclaimer: I worked at Databricks for 6 years and talked to hundreds of prospect and actual Databricks users and customers)

> The global economy is headed for a recession. That’s not my opinion, that’s the Federal Reserve’s

I know that's not the topic of the post but OP shows a lack of comprehension here. The Fed warned because that's what they do when the risk is non-negligible. They never say it is definitely going to happen because nobody knows.

I would consider following things before selecting a database offering:

- Whether the database vendor is a lock-in. It would be a straight "NO" for me if the database isn't open-source with proper license, since I can't self-host it in case, I need to move away from their SaaS offering due to various unexpected foreseen reasons: vendor decided to increase pricing, vendor has hidden pricing, vendor has reliability issues etc etc.

- How big is the community behind the database. Check their public forums to understand how community feels about the database and how their requirements are considered.

- Don't believe any random benchmarking post online but by doing my own benchmarking for my use-case.

- Check which other companies have adopted that database and read their experiences.

For the non data experts, from elsewhere on medium: “ Semi-structured data tends to evolve over time. Systems that generate data add new columns to accommodate additional information, which requires downstream tables to evolve accordingly. The structure of tables in Snowflake can evolve automatically to support the structure of new data received from the data sources”

“When dealing with large datasets, the processing power of individual machines can become limiting, necessitating the use of distributed and parallel processing capabilities provided by platforms such as Databricks“

That's common practice with mid sized companies and above, though:

- find something that can probably give some competitive advantage or please shareholders - Identify it as "not part of our core business" - Pay a third party company to do it even if it would be cheaper to things in house. If something goes wrong and the shareholders ask questions, the CEO can blame them instead of their own IT department and reassure them they're "focusing on what generates most value for the company".

i am not sure why everyone (comments, this post, etc.) assume there is a one-size-fits-all solution to every problem, even this one that looks quite simple. companies have to align a few things, ranging from the skills of the current employees, hiring plans, investor/shareholder management, or how to make sure the CEO really gets that boat he really wants and deserves.

not all businesses are the same. businesses with fat contracts but few users won't have massive operating costs, and they can use whichever easy and non-scalable technology they want, because once the business scales up, those fat contracts will pay for enough data engineers. a gaming startup will face high server costs right away, without any optimisation, while data platforms (e.g. bigquery) with a tiny bit of optimisation (materialising 2-3 summary tables, for example) will bring the cost down to "laughable" pretty easily.

it is true that many of these things are choices, e.g. do you really want to spend a shit ton of money for looker when superset for most users is just as good? are you even able to make that choice? if these choices are hard to make because a potential user (or set of users) in the company really wants something instead of something else, well, that is not a technical choice, and the issue you have has nothing to do with the technology.

The thing is, you don't understand that a business needs to stay focused and that has a price. When you buy those solutions, you are paying for the price of staying focused.

If the price for staying focused is $X, and X has a positive business impact and need, like Snowflake or Databricks could.

And it doesn't mean you won't need to spend on HR, developers, management and so on to run your own OSS solution.

It is actually great to give Snowflake and/or Databricks your money. It's a really expensive service, with a huge markup. That I understand.

But the alternative doesn't look good at all. A company is better staying focused.

Also, if the company sees that it expends a considerable amount of its budget in Snowflake/Databricks, they'll find solutions, that could be negotiating with them, figuring out how to optimize their use etc.

I've worked and seen many companies optimizing their cost structure and run away from Datadog/Newrelic, Snowflake/Databricks and just fail miserably.

The alternatives made developers simply use it less, and also spend more time to do the same that they did.

They even sometimes call it successful, execs point out the money saved, but they don't see the subjective part: developers being less efficient, wasting their time with BS.

There's a long tail of costs related to inefficient systems, it's wasting developer hours, that you need to manage, hire, train...

Not to forget the burden that it is to fire or make people redundant, companies end up losing so much of their culture and soul doing such things. And this is what you need to typically do if you assume those expenses of running things for yourself.

The ideal company are platforms that pays for many SaaS and stay focused on its core business, if it can make money, great!

There is so much literature on cost optimization in business in general, stuff that have been studied for over a century. If people would read on that, they wouldn't be repeating such nonsense as this article.

Also, the article says that you are OVERPAYING for it if you aren't a Fortune 100. It's much more like the opposite, if you have a small company with a small amount of data, it's very likely that Snowflake or Databricks won't cost you that much.

Based also on the other comments on HN, I can only conclude that this is one of those articles that is so full of mistakes and plainly wrong, that there isn't even a real debate over here, just people saying how wrong it is.

This reminds me of something I've seen too -- companies paying for enterprise wordpress (>$25k a year) when they could easily have say a $100 wordpress plan with all the features they need, behind free or cheap cloudflare.
No one needs to evaluate all options against their current/known/possible needs when they can just pick a well vetted product and get on with business.
Most companies don't use Snowflake or Databricks
Agreed, not completely on duckdb but we used it for consolidating billing data from 10+ ERP systems and it works, so I see his point. Just to add to his points:

- Integrations are still one of the hardest things in Enterprise IT. Snowflake/Databricks/etc in fact add to the number of systems to integrate, they make this problem worst most of the times

- Governance in a self-service data ecosystem gets complicated fast, especially if you need to stay compliant with data privacy, gdpr, etc.. And amazingly again neither Snowflake nor Databricks solve this. In fact, they make it worst by sucking up budget away from governance initiatives

I guess the ETL orchestration was the hard part? How did you do it?
Connectivity was the hardest part, we had to write Python connectors for a variety of ERP systems. We have a 1 server setup where we run duckdb and Python scripts, we monitor and orchestrate with Prefect but you can use one of the many Python orchestration tools. We load the finished data marts to a MS SQL server and users connect to it via PowerBI or Excel.
I don't understand why would you use duckdb as an intermediate step to fill Ms SQL DW. Surely you could just go to Ms SQL directly?
I'm sure DuckDB was used to transform/prepare the data from the ERP extract format to an MSSQL ingestion format.

There are plenty of arguments and reasons why you would use DuckDB to do this esp if you're preparing the data for Analytical/OLAP use-cases.

Perhaps a more relevant question might be why they didn't use DataFactory or some other ETL tool/service. DuckDB is rising the occasion for these kinds of use-cases though.

Let me give you some bullets on DataFactory (DF) because it is a question I get a lot.

- DF is quite hard to operationalize, the logging is not so good, Python stack traces are easier to debug and logging can get as detailed as needed

- DF lacks connectors for data ingestion, this is easy in Python as on average a custom connector takes a week or two to develop

- DF is not data pipelines as code and it is becoming really hard to manage governance and change management on UI based ETL tools

- It is hard to enforce best practices on DF. We are finding it is easy to enforce standard ways of writing and managing SQL models and metadata with a combo of dbt and a dbt-ready data catalog

73.6% of all statistics are made up
9/10 dentists prefer snowflake.
Just like the statistics in the title; in reality that number is vastly higher.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal