Every company that has ignored my following advice has experienced a day for day slip in first quarter scheduling. And that advice is: not much work gets done between Dec 15 and Jan 15. You can rely on a week worth, more than that is optimistic. People are taking it easy and they need to verify things with someone who is on vacation so they are blocked. And when that person gets back, it’s two days until their vacation so it’s a crap shoot.
NB: there’s work happening on Jan 10, for certain, but it’s not getting finished until the 15th. People are often still cleaning up after bad decisions they made during the holidays and the subsequent hangover.
Canary deployment, testing environments, unit tests, integration tests, anything really?
It sounds like they test by merging directly to production but surely they don't
It's still a bit silly though, their claimed reasoning probably doesn't really stack up for most of their config changes - I don't see it to be that likely that a 0.1->1->10->100 rollout over the period of 10 minutes would be a catastrophically bad idea for them for _most_ changes.
And to their credit, it does seem they want to change that.
A key part of secure systems is availability...
It really looks like vibe-coding.
It's never right to leave structural issues even if "they don't happen under normal conditions".
Military hardware is produced with engineering design practices that look nothing at all like what most of the HN crowd is used to. There is an extraordinary amount of documentation, requirements, and validation done for everything.
There is a MIL-SPEC for pop tarts which defines all parts sizes, tolerances, etc.
Unlike a lot in the software world military hardware gets DONE with design and then they just manufacture it.
I realise this may probably boggle the mind of the modern software developer.
They're going to see "oh, it leaks 3MiB per minute… and this system runs for twice as long as the old system", and then they're going to think for five seconds, copy-paste the appropriate paragraph, double the memory requirements in the new system's paperwork, and call it a day.
Checklists work.
I won’t remember this block of code because five other people have touched it. So I need to be able to see what has changed and what it talks to so I can quickly verify if my old assumptions still hold true
Well obviously not, because the front fell off. That’s a dead giveaway.
When talking of their earlier Lua code:
> we have never before applied a killswitch to a rule with an action of “execute”.
I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?
It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.
I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.