Preferences

Services usually depend on databases. Libraries usually don’t. Either you need to support every storage backend your users might have, require them to write an integration layer from your generic hooks, or expect them to provision and manage new storage when using your library. In any case you are asking them to do a lot more work (manage the data) and in some sense breaking encapsulation by making them responsible for this.

It's only an unreasonable amount of work if you assume that the user is managing a separate storage backend for each library. If you take the Tim Berners-Lee approach (re: https://solidproject.org/) then each user is only managing one storage backend: the one that stores their data. The marginal cost of hooking in one more library to the existing backend is low.

We just have to get a little more fed up with all of these services and then the initial cost of setting it up in the first place will be worth it. Any day now...

I think most interesting web services are providing structured access to the same data for multiple people. A private, individual data silo wouldn't get the job done unless combined with some kind of message-passing. A silo to which users can invite peers is interesting, but it's an important characteristic of many web services that the specific read and write transactions allowed are application-defined... you don't actually want to give your collaborators general read or write access at the storage level.

For example, it's important that I can add this comment, and I can't delete your comment, but the moderators can. The "storage" software would have to know something about the business logic of a web forum to make that happen.

I think we just need smarter browsers which can be configured to know who we trust in which dimension.

If I want to leave a comment on an article and then delete it, I can publish the comment in my pod, and I can also publish the deletion. If you've got your browser in a mode where it's interested in my comments, it can pull in the data from my pod and render it in context with the article--whether or not the article's author cared to provide a comments section.

If you drop the idea that anyone is authoritative about how it all comes together on the viewer's screen, you can also dispense with the headaches of being that authority (e.g. services).

This shows that we lack good abstractions over storage.
Cloud providers are (counter to intuition about their lock-in incentives) improving the space of this. Lots of tools now allow configuring storage by just pointing to various cloud stores, most often a S3 compatible api, not exclusively though.

K8s PersistentVolume is another decent shot at storage abstraction, only a bit raw.

Finally, more and more tools expect you to have a Postgres they can plug into as backend.

All above assumes you want to treat the library data as a big unknown blob. Once data start being corrupted and need bespoke repair, things are less fun. Access and retention is another fun rabbit hole.

Data is complicated.

It’s more like data storage needs are not one size fits all so it’s better left to the user who best knows their storage needs.
This is interesting, makes me wonder if a "dockerised" database is something people could use. I mean a database frontend with its own language/protocols/whatever that allows you to define the data structure but leaves the specific storage engine or format as a backend detail that can change from platform to platform.
> I mean a database frontend with its own language/protocols/whatever that allows you to define the data structure but leaves the specific storage engine or format as a backend detail that can change from platform to platform.

That's more or less a description of SQL.

Postgres wire format is indirectly getting there. Plenty of tools use that with wildly different storage engines on the other end.

A clean room implementation would likely yield different results but there appears to be some appetite for a solution.

Nah. It's not that. We lack a concept that can organize storage. Let me illustrate this.

So, until some years ago there was complete nonsense and anarchy in Linux networking management. That is until we got the "ip" program. There's still nonsense and anarchy, because the "ip" program doesn't cover everything, but it's on the right track to organize everything Linux knows about networking under one roof. So, vendors today, like, say, Melanox (i.e. NVidia) choose to interface with "ip" and work with that stack rather than invent their own interfaces.

When it's extendable in predictable and convenient ways, user will extend and enrich functionality.

Now, compare this to Linux storage... I want to scream and kill somebody every time I have to deal with any aspect of it because of how poorly mismanaged it is. There's no uniformity, plenty of standards where at most one is necessary, duplication upon duplication, layers... well, forget layers. Like, say, you wanted a RAID0, well, you have MD RAIDs, you have LVM RAIDs, you have ZFS RAIDs, you have multipassing with DM (is that a RAID, well sorta' depends on what you expected...) also, well, Ceph RDB are also kind of like RAIDs, DRBD can also sort of be like a RAID...

Do you maybe also want snapshots? How about encryption? -- Every solution will end up so particularly tailored to the needs of your organization that even an experienced admin in this very area your org is specializing will have no clue what's going on with your storage.

Needs can be studied, understood, catalogued, rolled into some sort of a hierarchy or some other structure amenable to management. We haven't solved this problem. But we have an even bigger one: no coordination and no desire to coordinate even within Linux core components, forget third-party vendors.

You cannot abstract away a 3 order of magnitude difference in bandwidth and latency.
Where did you get a 3 order of magnitude difference? Are you still using hard drives for your storage medium?
Adding two numbers together takes on the order of a nanosecond. Doing the same thing using a rest / http service (like an idiot) in the same datacenter takes on the order of a millisecond. Six orders of magnitude actually.
I'm pretty sure the parent comment was about storage media, not about network hops and service boundaries. Also, extremely basic REST/HTTP services definitely do not take 1 millisecond even if you are bad at software - the overhead of that stack is in the tens of microseconds if you are doing nothing.

For the comparison being referenced here, if you want to compare RAM, the storage medium that backs compute, to modern persistent storage, here it is:

* 40 GB/s per DIMM vs 5-10 GB/s per NVMe SSD. At most one order of magnitude off, but you can pack enough disks into a computer that the throughput ratio is almost 1:1. AWS EBS is about 1 order of magnitude different here, and that is with network-attached storage.

* 100-200 ns latency (RAM) vs 10-50 us (fast SSD) - about 2 orders of magnitude, but also possible to hide with batching.

The space is complex enough that I wonder if it's possible to make abstractions that aren't horribly leaky.
I blame the SQL "standard". It's a massive, unnecessary abstraction layer that only complicates attempts to build bridges between code and relational databases (which I believe is the most general-purpose paradigm).

Personally, I am working on a modern Python ORM for PostgreSQL and PostgreSQL alone.

Isn't this what SQL is supposed to be? You bring the DBI for your database and plug it into the app. Shame that it doesn't work out so well in practice.
But that and abstraction exist, because SQL exists.

If the library is designed to send sql to another storage library...

The title of the article literally says "where possible". You found a case when it's not possible, and decided to argue against that...

No, not all services come connected with a database. Alternatively, often times a database is an artifact of tenancy and the need to manage users which would not be needed, had the functionality be exposed as a library.

More importantly, whether users realize this or not, a library is more beneficial for them than a service in majority of cases. Much in the same way how it's almost always better to own something than to rent it.

Just to give some examples of the above: all the Internet of crap stuff, all sorts of "smart" home nonsense which requires that you subscribe to a service, install an app on your phone and send all your private data unsupervised to some shady Joe Shmo who you know nothing about. To be more specific, take something like Google Nest thermostat. There's no reason this contraption should ever go on the Internet, nor should it require to know your street address, nor your email etc. In fact, the utility it brings is very marginal (saves you few steps you'd have to make to reach for the boiler's controls to program it). It absolutely could've been designed in such a way that it doesn't connect to the Internet, or, at least, to never leave the local area network, and yet it's a cloud service...

I'm maybe naive, but is it not possible to supply a repository interface for the user to implement? Bring your own glue?

The library uses only the interface to work with whatever orm/db connector exists in the client project.

If services at any given company all use a standard db library, it could even directly interface assuming your using that. I don't think we're talking about public apis and packages here.

The underlying point of the post seems to be that it is better to ask the user to do more work, than the developer.
Great point. In my opinion it is possible and maybe even ideal to do both: make it easy for anyone to run their own service while also running your own service so that users have the option to not have to manage the data, patching and ops side.
sqlite is a library

zeromq is a library

That’s all the storage you need

No? If you have a horizontally scaled architecture or anything with multiple nodes, you can't just get away with "sqlite is a library".
SQLite is more about letting you get away with not having a horizontally scaled architecture or anything with multiple nodes in the first place.
SQLite alone doesn't help you get away with any of that. If you already know that a single machine is enough for you, then sure, SQLite is a fine choice. Availability needs alone often force you to run multi-node setups.
I'd argue that salesforce could run on sqlite (library)
Almost all software is multitenant / easily shardable in some way. Ans almost all software can easily run on a single machine.
Most private software is, but there's an awfully large amount of publicly served software that can't fit into this model (also, any software that has network effects like twitter).

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal