The problem is that Cloudflare do incremental rollouts and loads of testing for _code_. But they don't do the same thing for configuration - they globally push out changes because they want rapid response.
It's still a bit silly though, their claimed reasoning probably doesn't really stack up for most of their config changes - I don't see it to be that likely that a 0.1->1->10->100 rollout over the period of 10 minutes would be a catastrophically bad idea for them for _most_ changes.
And to their credit, it does seem they want to change that.
Yeah to me it doesn't make any sense - configuration changes are just as likely to break stuff (as they've discovered the hard way) and both of these issues could have been found in a testing environment before being deployed to production
In the post they described that they observed errors happening in their testing env, but decided to ignore because they were rolling out a security fix. I am sure there is more nuance to this, but I don’t know whether that makes it better or worse
> but decided to ignore because they were rolling out a security fix.
A key part of secure systems is availability...
It really looks like vibe-coding.
Canary deployment, testing environments, unit tests, integration tests, anything really?
It sounds like they test by merging directly to production but surely they don't