I did end up moving it Redis and basically ZADD an execution timestamp and job ID, then ZRANGEBYSCORE at my desired interval and remove those jobs as I successfully distribute them out to workers. I then set a fence time. At that time a job runs to move stuff that should have ran but didn’t (rare, thankfully) into a remediation queue, and load the next block of items that should run between now + fence. At the service level, any items with a scheduled date within the fence gets ZADDed after being inserted into the normal database. Anything outside the fence will be picked up at the appropriate time.
This worked. I was able to ramp up the polling time to get near-real time dispatch while also noticeably reducing costs. Problems were some occasional Redis issues (OOM and having to either a keep bumping up the Redis instance size or reduce the fence duration), allowing multiple pollers for redundancy and scale (I used schelock for that :/), and occasionally a bug where the poller craps out in the middle of the Redis work resulting in at least once SLA which required downstream protections to make sure I don’t send the same message multiple time to the patient.
Again, it all works but I’m interested in seeing if there are solutions that I don’t have to hand roll.
1. Functions which allow you to declaratively sleep until a specific time, automatically rescheduling jobs (https://www.inngest.com/docs/reference/functions/step-sleep-...).
2. Declarative cancellation, which allows you to cancel jobs if the user reschedules their appointment automatically (https://www.inngest.com/docs/guides/cancel-running-functions).
3. General reliability and API access.
Inngest does that for you, but again — disclaimer, I made it and am biased.