- 4 points
- I really enjoyed this post and love seeing more lightweight approaches! The deep dive on tradeoffs between different durable-execution approaches was great. For me, the most interesting part is that Persistasaurus (cool name btw) use of bytecode generation via ByteBuddy is a clever way to improve DX: it can transparently intercept step functions and capture execution state without requiring explicit API calls.
(Disclosure: I work on DBOS [1]) The author's point about the friction from explicit step wrappers is fair, as we don't use bytecode generation today, but we're actively exploring it to improve DX.
- I'm excited about this because durable workflows are really important for making AI applications production ready :) Disclaimer: I'm working on DBOS, a durable workflow library built on Postgres, which looks complementary to this.
I asked their main developer Dillon about the data/durability layer and also the compilation step. I wonder if adding a "DBOS World" will be feasible. That way, you get Postgres-backed durable workflows, queues, messaging, streams, etc all in one package, while the "use workflow" interface remains the same.
Here is the response from Dillon, and I hope it's useful for the discussion here:
> "The primary datastore is dynamodb and is designed to scale to support tens of thousands of v0 size tenants running hundreds of thousands of concurrent workflows and steps."
> "That being said, you don't need to use Vercel as a backend to use the workflow SDK - we have created a interface for anyone to implements called 'World' that you can use any tech stack for https://github.com/vercel/workflow/blob/main/packages/world/..."
> "you will require a compiler step as that's what picks up 'use workflow' and 'use step` and applies source transformations. The node.js run time limitations only apply to the outer wrapper function w/ `use workflow`"
- 67 points
- We're seeing issues with multiple AWS services https://health.aws.amazon.com/health/status
- 4 points
- 8 points
- 3 points
- In DBOS, workflows can be invoked directly as normal function calls or enqueued. Direct calls don't require any polling. For queued workflows, each process runs a lightweight polling thread that checks for new work using `SELECT ... FOR UPDATE SKIP LOCKED` with exponential backoffs to prevent contentions, so many concurrent workers can poll efficiently. We recently wrote a blog post on durable workflows, queues, and optimizations: https://www.dbos.dev/blog/why-postgres-durable-execution
Throughput mainly comes down to database writes: executing a workflow = 2 writes (input + output), each step = 1 write. A single Postgres instance can typically handle thousands of writes per second, and a larger one can handle tens of thousands (or even more, depending on your workload size). If you need more capacity, you can shard your app across multiple Postgres servers.
- Good questions!
DBOS naturally scales to distributed environments, with many processes/servers per application and many applications running together. The key idea is to use the database concurrency control to coordinate multiple processes. [1]
When a DBOS workflow starts, it’s tagged with the version of the application process that launched it. This way, you can safely change workflow code without breaking existing ones. They'll continue running on the older version. As a result, rolling updates become easy and safe. [2]
[1] https://docs.dbos.dev/architecture#using-dbos-in-a-distribut...
[2] https://docs.dbos.dev/architecture#application-and-workflow-...
- I think one potential concern with "checkpoint execution state at every interaction with the outside world" is the size of the checkpoints. Allowing users to control the granularity by explicitly specifying the scope of each step seems like a more flexible model. For example, you can group multiple external interactions into a single step and only checkpoint the final result, avoiding the overhead of saving intermediate data. If you want finer granularity, you can instead declare each external interaction as its own step.
Plus, if the crash happens in the outside world (where you have no control), then checkpointing at finer granularity won't help.
- I think a clearer way to think about this is "at least once" message delivery plus idempotent workflow execution is effectively exactly-once event processing.
The DBOS workflow execution itself is idempotent (assume each step is idempotent). When DBOS starts a workflow, the "start" (workflow inputs) is durably logged first. If the app crashes, on restart, DBOS reloads from Postgres and resumes from the last completed step. Steps are checkpointed so they don't re-run once recorded.
- That password is only used by the GHA to start a local Postgres Docker container (https://github.com/dbos-inc/dbos-transact-golang/blob/main/c...), which is not accessible from outside.
- 2 points
- I think it was likely caused by the cache trying to compare the tag with Docker Hub: https://docs.docker.com/docker-hub/image-library/mirror/#wha...
> "When a pull is attempted with a tag, the Registry checks the remote to ensure if it has the latest version of the requested content. Otherwise, it fetches and caches the latest content."
So if the authentication service is down, it might also affect the caching service.
- 4 points
- The main advantage is the same architectural benefit DBOS provides in other languages: you only need to deploy your application, so there's no separate coordinator to run. All functionality (checkpointing, durable queues, notification/signaling, etc) is built directly into the Go package on top of the database.
- Those are great questions!
For versioning, we recommend keeping each version running until all workflows on that version are done. It's similar to a blue-green deployment: each process is tagged with one version, and all workflows in it share that version. You can list pending/enqueued workflows on the old version (UI or list_workflow programmatic API), and once that list drains, you can shut down the old processes. DBOS Cloud automates this, and we'll add more guidance for self-hosting.
For bugfixes, DBOS supports programmatic forking and other workflow management tools [1]. We deliberately don't support code patching because it's fragile and hard to test. For example, patches can pile up on long-running workflows and make debugging painful.
The main limit is the database (which you can control the size). DBOS writes workflow inputs, step outputs, and workflow outputs to it. There's no step limit beyond disk space. Postgres/SQLite allow up to 1 GB per field, but keeping inputs/outputs under ~2 MB helps performance. We'll add clearer guidelines to the docs.
Thanks again for all the thoughtful questions!
[1] https://docs.dbos.dev/python/reference/contexts#fork_workflo...
- Thanks for sharing your insights! You nailed the key tradeoffs of most durable workflow systems. The callback-style programming model is exactly the pain point we aim to solve with DBOS.
Instead of forcing you into a custom async runtime, DBOS lets you keep writing normal functions (this is an example in Python):
Under the hood, DBOS checkpoints inputs/outputs so it can recover after failure, but you don't have to restructure your code around callbacks. In Python and Java we use decorators/annotations so registration feels natural, while in Go/TypeScript there's a lightweight one-time registration step. Either way, you keep the synchronous call style you'd expect.@DBOS.workflow() def do_thing(foo): return bar # You can still call the workflow function like this: result = do_thing(fooInput)On top of that, DBOS also supports running workflows asynchronously or through queues, so you can start with a simple function call and later scale out to async/queued execution without changing your code. That's what the article was leading into.
- I've been building an integration [1] with Pydantic AI and the experience has been great. Questions usually get answered within a few hours, and the team is super responsive and supportive for external contributors. The public API is easy to extend for new functionality (in my case, durable agents).
Its agent model feels similar to OpenAI's: flexible and dynamic without needing to predefine a DAG. Execution is automatically traced and can be exported to Logfire, which makes observability pretty smooth too. Looking forward to their upcoming V1 release.
Shameless plug: I've been working on a DBOS [2] integration into Pydantic-AI as a lightweight durable agent solution.
- 5 points
- 5 points
- Yeah, we plan to add more languages. Currently supports Python and TypeScript, and Go and Java will be released soon. We’re having a preview of DBOS Java at our user group meeting on August 28: https://lu.ma/8rqv5o5z Welcome to join us! We’d love to hear your feedback.
We welcome community contributions to the open source repos.
- Managing complex scheduled workflows at scale comes with a lot of nuances. This is exactly why we're building DBOS (shameless plug! https://github.com/dbos-inc), which provides durable cron jobs and exactly-once workflow triggering. Since it's just a library on top of Postgres, it doesn't require a centralized scheduler (well, think of Postgres as the coordinator).
One challenge is to guarantee exactly-once processing across software upgrades. DBOS uses the cron-scheduled time as an idempotency key, and tags each workflow execution with a version. We also use the database transactions to guard against conflicting concurrent updates.
> During replay, your code runs from the beginning but skips over completed checkpoints, using stored results instead of re-executing completed operations. This replay mechanism ensures consistency while enabling long-running executions. > > ... During replay, your code runs from the beginning but skips over completed checkpoints, using stored results instead of re-executing completed operations. This replay mechanism ensures consistency while enabling long-running executions.