Preferences

Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.

The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.

Good reminder that you are only as strong as your weakest link.


This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.
> temporary kludge shim was the perfect level of abstraction for the problem at hand.

Thats some nice manager deactivating jargon.

Manager deactivating jargon is a great phrase - it’s broadly applicable and also specific.
Yeah that sentence betrays my BigCorp experience it’s pulling from the corporate bullshit generator for sure
+1...hee hee
Couldn't you just patch your coredns deployment to specify different forwarders?
Probably. This was years ago so the details have faded but I do recall that we did weigh about 6 different valid approaches of varying complexity in the war room before deciding this /etc/hosts hack was the right approach for our situation
This is the en of the thread of the first comment. Now i can find below the second comment
I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.
Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?
Wishful thinking, but I hope an engineer somewhere got to ram a door down to fix a global outage. For the stories.
Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.

On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".

> So security was basically "does someone else recognize you?"

I actually can't think of a more secure protocol. Doesn't scale, though.

way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.

/those were the days

I was in a datacenter when the fire alarm went off and all door locks were automatically disabled.
That sounds like an Equinix datacenter. They were painfully slow at 350 E. Cermak.
It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.
The story was that they had to use an angle grinder to get in.
I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.
Louvre gang decides they can make more money contracting to AWS.
The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.
I prefer to use a sawzall and just go through the wall.
I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.

add a bunch of other poinless scifi and evil villan lair tropes in as well...

Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.

Still have my "my other datacenter is made of razorblades and hate" sticker. \o/

Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.

Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.

Sometimes a little good old fashioned mayhem is good for employee morale
Every good firefighter knows this feeling.

Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.

P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.

Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.

I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.

"Meta Data Center Simulator 2021: As Real As It Gets (TM)"

Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.
Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)
Auditors made you disable credential caching but missed the door that could be shimmed open..
Sounds like they earned their fee!
If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.
This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.
There should be many more smaller instances with smaller blast radius.
Yep. And their internal comms were on the same server if memory serves. They were also down.
I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.

Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.

I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.
AWS, for the ultimate backup, relies on a phone call bridge on the public phone network.
Thanks for the correction, that sounds right. I thought I had remembered IRC but wasn't sure.
That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.

Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!

That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)
That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.
Are you asserting that Rogers employees needed documentation to know that Rogers Wireless runs on Rogers systems?
Rogers is perhaps best described as a confederacy of independent acquisitions. In working with their sales team, I have had to tell them where there facilities are as the sales engineers don't always know about all of the assets that Rogers owns.

There's also the insistence that Rogers employees should use Rogers services. Paying for every Rogers employee to have Bell cell phone would not sit well with their executives.

That the risk assessments of the changes being made to the router configuration were incorrect also contributed to the outage.

There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.
So sick of billion dollar companies not hiring that one more guy
That is perhaps why they are billion dollar companies and why my company is not very successful.
> Identity Center and only put it in us-east-1

Is it possible to have it in multiple regions? Last I checked, it only accepted one region. You needed to remove it first if you wanted to move it.

Security people and ignoring resiliency and failure modes: a tale as old as time
Correct. That does make it a centralized failure mode and everyone is in the same boat on that.

I’m unaware of any common and popular distributed IDAM that is reliable

Not sure if this counts fully as 'distributed' here, but we (Authentik Security) help many companies self-host authentik multi-region or in (private cloud + on-prem) to allow for quick IAM failover and more reliability than IAMaaS.

There's also "identity orchestration" tools like Strata that let you use multiple IdPs in multiple clouds, but then your new weakest link is the orchestration platform.

Disclosure: I work for FusionAuth, a competitor of Authentik.

Curious. Is your solution active-active or active-passive? We've implemented multi-region active-passive CIAM/IAM in our hosted solution[0]. We've found that meets needs of many of our clients.

I'm only aware of one CIAM solution that seems to have active-active: Ory. And even then I think they shard the user data[1].

0: https://fusionauth.io/docs/get-started/run-in-the-cloud/disa...

1: https://www.ory.com/blog/global-identity-and-access-manageme... is the only doc I've found and it's a bit vague, tbh.

Hey Dan, appreciate the discussion!

Ory’s setup is indeed true multi-region active-active; not just sharded or active-passive failover. Each region runs a full stack capable of handling both read and write operations, with global data consistency and locality guarantees.

We’ll soon publish a case study with a customer that uses this setup that goes deeper into how Ory handles multi-region deployments in production (latency, data residency, and HA patterns). It’ll include some of the technical details missing from that earlier blog post you linked. Keep an eye out!

There are also some details mentioned here: https://www.ory.com/blog/personal-data-storage

> I’m unaware of any common and popular distributed IDAM that is reliable

Other clouds, lmao. Same requirements, not the same mistakes. Source: worked for several, one a direct competitor.

Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.
You should assume it will not work unless you test it regularly. That's a big part of why having active/active multi-region is attractive, even though it's much more complex.
That wouldn't have even caught that, most likely unless they verified they had no incidental tie ins with us-east-1.
The last place I worked actively switched traffic over to the backup nodes regularly (at least monthly) to ensure we could do it when necessary.

We learned that lesson by having to do emergency failovers and having some problems. :)

for what it's worth, we were unable to login with root credentials anyway

i don't think any method of auth was working for accessing the AWS console

Sure it was, you just needed to login to the console via a different regional endpoint. No problems accessing systems from ap-southeast-2 for us during this entire event, just couldn’t access the management planes that are hosted exclusively in us-east-1.
Like the other poster said, you need to use a different region. The default region (of course) sends you to us-east-1

e.x. https://us-east-2.console.aws.amazon.com/console/home

It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.
If you don’t regularly restore a backup, you don’t have one.
Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.

Who watches the watchers.

Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.

The usability of AWS is so poor.

They don’t charge anything for Identity Center and so it’s not considered an important priority for the revenue counters.
I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright
People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.
Sounds like a lot of companies need to update their BCP after this incident.
"If you're able to do your job, InfoSec isn't doing theirs"

This item has no comments currently.