Comment by bagels - Hacker Neue

bagels Nov 20, 2024 parent

I found something important was broken at Facebook, created a SEV, but the hardest part was figuring out what team to page. I unfortunately roped in one non-responsible team's oncall who was making dinner at home, but thankfully, he did know what team to reach out to, based on the description of the problem.

Would be nice if there was better tooling for going from observed problem to responsible team.

chronid Nov 20, 2024

Some FAANGs at least (though they may not cover everything) have a "help something is broken but I don't know what to do" team and/or rotation for incident response, staffed on multiple continents to "follow the sun".

But you need to know they exist. :)

ElevenLathe Nov 21, 2024

I've worked on several such teams (not at FANGy places, but some household names), variously called just the NOC or SOC (early on in my career, the role was also a kind of on-duty Linux admin/computer generalist), Command Center, and Mission Control. It was great fun a lot of the time but the hours got to be tiresome.

I would be very surprised if any enterprise of significant size and IT complexity didn't have an IT incident response team. I'm biased but I think they are a necessity in complex environments where oncall engineers can't possibly even keep track of all their integrators and integrators' integrators, etc. It also helps to have incident commanders who do that job multiple times a week instead of a few times a decade.

fma Nov 20, 2024

I never worked at a FAANG...but a Fortune 20 company the last 9 years. There is no system of record of applications?

I can go to a website and type in search terms, URLs and pull up exactly who to contact. Even our generic "help something is broken" group relies on this. There are many names listed so even if the on call person listed is "making dinner", you have their backup, their manager, etc.

I can tag my system as dependent on another and if they have issues I get alerted.

chronid Nov 20, 2024

I am fairly simplifying, but you are expected to know your direct dependencies (and normally wil), pagers have embedded escalation rules with prinaries and secondaries, etc. The tooling once you know what to do is better than anything outside of FAANGs I've seen in terms of integration and reliability.

Escalation teams are usually reserved for the "oh fuck" situations, like "I don't work on this site but I found it broken" or "hey I think we are going to lose soon this availability zone" or "I am panicking and have no idea how to manage this incident, please help me".

They're a glue mechanism to prevent silos and paralysis during an event, usually pretty good engineers too.

jedberg Nov 20, 2024

That was one of the first things we built at Netflix when I got there. We had a paging schedule tied to every micro service. If you knew what service was broken, you could just "page the service" and their current on call would get paged.

If you didn't know what it was, you could page the SRE team and we'd diagnose with you.

Sometimes as SREs we would shortcut the process and just know who the right person is with the answer, but at least this way that tribal knowledge was somewhat encoded.

bagels OP Nov 21, 2024

Yeah, if you know what service is down, it's also trivial at Facebook to track down the oncall for that service. What isn't trivial is when you get a blank page where there might be dozens or hundreds of teams that might be responsible.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous