Comment by time0ut - Hacker Neue

time0ut Oct 20, 2025 parent

Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.

The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.

Good reminder that you are only as strong as your weakest link.

SOLAR_FIELDS Oct 20, 2025

This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.

citizenpaul Oct 20, 2025

> temporary kludge shim was the perfect level of abstraction for the problem at hand.

Thats some nice manager deactivating jargon.

LPisGood Oct 20, 2025

Manager deactivating jargon is a great phrase - it’s broadly applicable and also specific.

SOLAR_FIELDS Oct 20, 2025

Yeah that sentence betrays my BigCorp experience it’s pulling from the corporate bullshit generator for sure

ct_list Oct 21, 2025 (dead)

johndubchak Oct 20, 2025

+1...hee hee

jordanb Oct 21, 2025

Couldn't you just patch your coredns deployment to specify different forwarders?

SOLAR_FIELDS Oct 21, 2025

Probably. This was years ago so the details have faded but I do recall that we did weigh about 6 different valid approaches of varying complexity in the war room before deciding this /etc/hosts hack was the right approach for our situation

nahumba Oct 21, 2025

This is the en of the thread of the first comment. Now i can find below the second comment

1970-01-01 Oct 20, 2025

I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.

crote Oct 20, 2025

Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?

kylecazar Oct 20, 2025

Wishful thinking, but I hope an engineer somewhere got to ram a door down to fix a global outage. For the stories.

jedberg Oct 20, 2025

Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.

On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".

terminalshort Oct 20, 2025

> So security was basically "does someone else recognize you?"

I actually can't think of a more secure protocol. Doesn't scale, though.

5 More Comments →

chasd00 Oct 20, 2025

way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.

/those were the days

3 More Comments →

UltraSane Oct 20, 2025

I was in a datacenter when the fire alarm went off and all door locks were automatically disabled.

11 More Comments →

E39M5S62 Oct 20, 2025

That sounds like an Equinix datacenter. They were painfully slow at 350 E. Cermak.

jedberg Oct 20, 2025

It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.

wolpoli Oct 20, 2025

The story was that they had to use an angle grinder to get in.

jonbiggums22 Oct 20, 2025

I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.

10 More Comments →

hshdhdhehd Oct 20, 2025

Louvre gang decides they can make more money contracting to AWS.

SoftTalker Oct 21, 2025

The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.

bombcar Oct 20, 2025

I prefer to use a sawzall and just go through the wall.

3 More Comments →

bluGill Oct 20, 2025

I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.

add a bunch of other poinless scifi and evil villan lair tropes in as well...

donalhunt Oct 20, 2025

Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.

Still have my "my other datacenter is made of razorblades and hate" sticker. \o/

12 More Comments →

lazide Oct 21, 2025

Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.

Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.

geephroh Oct 21, 2025

Sometimes a little good old fashioned mayhem is good for employee morale

lazide Oct 21, 2025

Every good firefighter knows this feeling.

Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.

P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.

lenerdenator Oct 20, 2025

Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.

I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.

"Meta Data Center Simulator 2021: As Real As It Gets (TM)"

UltraSane Oct 20, 2025

Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.

holowoodman Oct 20, 2025

Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)

jordanb Oct 21, 2025

Auditors made you disable credential caching but missed the door that could be shimmed open..

AbstractH24 Oct 21, 2025

Sounds like they earned their fee!

UltraSane Oct 21, 2025

If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.

avidphantasm Oct 20, 2025

This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.

UltraSane Oct 20, 2025

There should be many more smaller instances with smaller blast radius.

junon Oct 20, 2025

Yep. And their internal comms were on the same server if memory serves. They were also down.

simplyluke Oct 20, 2025

I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.

Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.

prmoustache Oct 20, 2025

I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.

elchananHaas Oct 20, 2025

AWS, for the ultimate backup, relies on a phone call bridge on the public phone network.

2 More Comments →

junon Oct 20, 2025

Thanks for the correction, that sounds right. I thought I had remembered IRC but wasn't sure.

bcrl Oct 21, 2025

That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.

Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!

YokoZar Oct 21, 2025

That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)

bcrl Oct 21, 2025

That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.

YokoZar Oct 22, 2025

Are you asserting that Rogers employees needed documentation to know that Rogers Wireless runs on Rogers systems?

bcrl Oct 22, 2025

Rogers is perhaps best described as a confederacy of independent acquisitions. In working with their sales team, I have had to tell them where there facilities are as the sales engineers don't always know about all of the assets that Rogers owns.

There's also the insistence that Rogers employees should use Rogers services. Paying for every Rogers employee to have Bell cell phone would not sit well with their executives.

That the risk assessments of the changes being made to the router configuration were incorrect also contributed to the outage.

ttul Oct 20, 2025

There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.

beefnugs Oct 20, 2025

So sick of billion dollar companies not hiring that one more guy

ttul Oct 20, 2025

That is perhaps why they are billion dollar companies and why my company is not very successful.

vladvasiliu Oct 20, 2025

> Identity Center and only put it in us-east-1

Is it possible to have it in multiple regions? Last I checked, it only accepted one region. You needed to remove it first if you wanted to move it.

raverbashing Oct 21, 2025

Security people and ignoring resiliency and failure modes: a tale as old as time

AndrewKemendo Oct 20, 2025

Correct. That does make it a centralized failure mode and everyone is in the same boat on that.

I’m unaware of any common and popular distributed IDAM that is reliable

fheisler Oct 20, 2025

Not sure if this counts fully as 'distributed' here, but we (Authentik Security) help many companies self-host authentik multi-region or in (private cloud + on-prem) to allow for quick IAM failover and more reliability than IAMaaS.

There's also "identity orchestration" tools like Strata that let you use multiple IdPs in multiple clouds, but then your new weakest link is the orchestration platform.

mooreds Oct 27, 2025

Disclosure: I work for FusionAuth, a competitor of Authentik.

Curious. Is your solution active-active or active-passive? We've implemented multi-region active-passive CIAM/IAM in our hosted solution[0]. We've found that meets needs of many of our clients.

I'm only aware of one CIAM solution that seems to have active-active: Ory. And even then I think they shard the user data[1].

0: https://fusionauth.io/docs/get-started/run-in-the-cloud/disa...

1: https://www.ory.com/blog/global-identity-and-access-manageme... is the only doc I've found and it's a bit vague, tbh.

vinckr Oct 28, 2025

Hey Dan, appreciate the discussion!

Ory’s setup is indeed true multi-region active-active; not just sharded or active-passive failover. Each region runs a full stack capable of handling both read and write operations, with global data consistency and locality guarantees.

We’ll soon publish a case study with a customer that uses this setup that goes deeper into how Ory handles multi-region deployments in production (latency, data residency, and HA patterns). It’ll include some of the technical details missing from that earlier blog post you linked. Keep an eye out!

There are also some details mentioned here: https://www.ory.com/blog/personal-data-storage

bravetraveler Oct 21, 2025

> I’m unaware of any common and popular distributed IDAM that is reliable

Other clouds, lmao. Same requirements, not the same mistakes. Source: worked for several, one a direct competitor.

barbazoo Oct 21, 2025

Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.

shdjhdfh Oct 21, 2025

You should assume it will not work unless you test it regularly. That's a big part of why having active/active multi-region is attractive, even though it's much more complex.

ej_campbell Oct 21, 2025

That wouldn't have even caught that, most likely unless they verified they had no incidental tie ins with us-east-1.

jpollock Oct 21, 2025

The last place I worked actively switched traffic over to the backup nodes regularly (at least monthly) to ensure we could do it when necessary.

We learned that lesson by having to do emergency failovers and having some problems. :)

shawabawa3 Oct 20, 2025

for what it's worth, we were unable to login with root credentials anyway

i don't think any method of auth was working for accessing the AWS console

kondro Oct 20, 2025

Sure it was, you just needed to login to the console via a different regional endpoint. No problems accessing systems from ap-southeast-2 for us during this entire event, just couldn’t access the management planes that are hosted exclusively in us-east-1.

nijave Oct 20, 2025

Like the other poster said, you need to use a different region. The default region (of course) sends you to us-east-1

e.x. https://us-east-2.console.aws.amazon.com/console/home

reenorap Oct 21, 2025

It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.

sroussey Oct 21, 2025

If you don’t regularly restore a backup, you don’t have one.

hinkley Oct 20, 2025

Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.

Who watches the watchers.

ej_campbell Oct 21, 2025

Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.

The usability of AWS is so poor.

skywhopper Oct 21, 2025

They don’t charge anything for Identity Center and so it’s not considered an important priority for the revenue counters.

ct520 Oct 21, 2025

I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright

ransom1538 Oct 21, 2025

People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.

ozim Oct 21, 2025

Sounds like a lot of companies need to update their BCP after this incident.

michaelcampbell Oct 21, 2025

"If you're able to do your job, InfoSec isn't doing theirs"

ct_list Oct 21, 2025 (dead)

saltserv Oct 21, 2025 (dead)

This item has no comments currently.