Preferences

I think most companies don't understand a large part of the Databricks offering and it should be used by way more organizations. Disclaimer: I was a Databricks user for 6 years and now work at Databricks.

Yea, you can create your own Spark deployment, but it will run much slower than the Databricks Runtime (DBR) or the Databricks proprietary Spark Runtime (Photon). Computations that run slower cause you to have a larger cloud compute bill. Databricks rewrote Spark in C++ and it runs really fast and saves a lot on ec2 compute.

> Define when you should compact files, when to Z-order

Or don't consider these issues and use autocompaction / the new Liquid clustering. These are great examples of problems the platform should solve, so the user has time to focus on business logic.

> If you can sniff out the inefficiencies in your Data early and make architecture that handles your specific data

I don't know what this means.

Are you going to build a deep learning model to make read/writes faster like Databricks predictive I/O? https://docs.databricks.com/en/optimizations/predictive-io.h.... Probably not, you have a lot of business problems to solve.

> Do the real work. Work with people. The Code will write itself.

I've seen lots of DIY data platforms. They're horrible to work with and I can assure you that the code does not write itself. The data engineers have a lot less time to write code because they're constantly trying to stand the platform back up.


> Are you going to build a deep learning model to make read/writes faster like Databricks predictive I/O?

It makes sense for Databricks to do this because they're building a product that needs to work for a high cardinality of datasets. Using a DNN for that is defensible because the input shapes are practically infinite. But for individual orgs, it seems much more likely that simple heuristics-driven access pattern optimizations can be done without throwing ML at the problem (though I'll say the predictive IO concept is a cool one, I've done similar ML work for network traffic QoS).

> I've seen lots of DIY data platforms. They're horrible to work with and I can assure you that the code does not write itself.

This one made me scratch my head a bit, because I have seen many DIY platforms as well over a 20 year career. Most of them have been amazing to work with. The code didn't write itself of course, but maintenance burdens were low and specialization/expertise within the org was high as a result. On top of that, I didn't need to argue with an AE about rising costs on a per-annum basis whenever renewal time comes around.

I point this out not in the service of coloring your observations as wrong or misguided, but to highlight that there's going to be a spectrum of varied lived experiences people have with the build-vs-buy conundrum. I expect we'll never see genuine consensus on this issue as a result of that variance (and maybe we don't have to).

I begrudgingly agree. We tried to stand up a spark cluster and there are just loads of configuration and little niggles that just get in the way, like making sure everything on your cluster has the libraries they need.

Putting it on databricks most of those problems went away and we could just get in with writing the code.

Could we have figured out how to get spark working in a bespoke configuration? almost certainly. By our mid September deadline? Probably not

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal