Preferences

I’ve always concurred with the Helland/Kleppman observation mentioned viz. that the transaction log of a typical RDBMS is the canonical form and all the rows & tables merely projections.

It’s curious that over those projections, we then build event stores for CQRS/ES systems, ledgers etc, with their own projections mediated by application code.

But look underneath too. The journaled filesystem on which the database resides also has a log representation, and under that, a modern SSD is using an adaptive log structure to balance block writes.

It’s been a long time since we wrote an application event stream linearly straight to media, and although I appreciate the separate concerns that each of these layers addresses, I’d probably struggle to justify them all from first principles to even a slightly more Socratic version of myself.


This is similar to the observation that memory semantics in most any powerful machine since the late 60s are implemented using messaging, and then applications go ahead and build messaging out of memory semantics. Or the more general observation that every layer of information exchange tends towards implementing packet switching if there's sufficient budget (power/performance/cost) to support doing so.
> It’s curious that over those projections, we then build event stores for CQRS/ES systems, ledgers etc, with their own projections mediated by application code.

The database only supports CRUD. So while the CDC stream is the truth, it's very low level. We build higher-level event types (as in event sourcing) for the same reason we build any higher-level abstraction: it gives us a language in which to talk about business rules. Kleppmann makes this point in his book and it was something of an aha moment for me.

I'm sorry; but have you ever actually used a database before? A database supports FAR more than "only CRUD". Some really simple examples are CTEs, SELECT ... INTO (or INSERT ... SELECT for some dialects), triggers, views, etc.
CTE's are extensions to try an regain some of what was lost when adopting a model loosely based on Codd's declarative relational algebra, specifically the lack of transitive closure, the rest mostly fit into the CRUD world.

It is a bit circular, CRUD's elements create, read, update, and delete were chosen to represent the core features of a persistence layer.

what you mention is a high level projection over transaction log and is subject to transaction isolation levels

https://www.postgresql.org/docs/current/transaction-iso.html

Have you completely missed the context of this thread? We're talking about the transaction log aka CDC. This consists purely of CRUD operations because that's all the database understands. Maybe read the books cited?
we're a very short step away from a "hardware" database
The table data in database is the canonical form. You can delete the transaction logs, and temporarily lose some reliability. It is very common to delete the transaction logs when not needed. When databases are backed up, they either dump the logical data or take snapshot of the data. Then can take stream of transaction logs for syncing or backup until the next checkpoint.

I'm pretty sure journalled filesystem recycle the journal. There are log-structured filesystem but they aren't used much beyond low-level flash.

Sorry, this is mistaking the operational for the fundamental.

If a transaction log is replayed, then an identical set of relations will be obtained. Ergo, the log is the prime form of the database.

It’s that simple.

At a very abstract level, maybe. But it's common not to log changes that can trivially be rolled back, like insertions into a table that was created or truncated within the transaction. Of course, such optimizations are incompatible with log-based replication. So the statement should probably be, “in a system with log-based replication, the log is authoritative, and the tables are just an optimization”. This framing also avoids ambiguities because a transaction log may not be fully serialized, and might not fully determine table contents.
At work we need to distribute daily changes to a dataset, so we have a series of daily deltas. If a new client is brought up, they need to apply all the deltas to get the current dataset.

This is time consuming, so we optimized it by creating "base versions" every month. So a client only needs to download the latest base version and the apply the deltas since then...

Which is what accountants call "closing the books". Once all ledgers have been reconciled, old ledgers can be archived and you go forward from the last closing.

Forensic accounting, incidentally, is when something went badly wrong and outside accountants have to go back through the old ledgers, and maybe old invoices and payments and reconstruct the books. FTX had to do that after the bankruptcy to find out where the money went and where it was supposed to go.

The transaction log maintained from time 0 would be equivalent but too expensive to store compared to the tables.
If you relax your constraint to "retain logs for the past N days", you can accumulate the logs from T=0 to T=(today - N) into tables and still benefit from having snapshots from that cutoff onwards.
On the contrary, I’ve known plenty of sites that keep their logs.

Often written to tape, for obvious reasons.

Conversely, given a database, you can't (in general) reconstruct the specific transaction log that resulted in it. You can reconstruct some log, but it's not uniquely defined and is missing a lot of potentially relevant information.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal