The hardest one I've debugged took a few months to reproduce, and would only show up on hardware that only one person on the team had.
One of the interesting things about working on a very mature product is that bugs tend to be very rare, but those rare ones which do appear are also extremely difficult to debug. The 2-hour, 2-day, and 2-week bugs have long been debugged out already.
The bug was actually quite funny in a way: it was in the code displaying the internal temperature of the electronics box of some industrial equipment. The string conversion was treating the temperature variable as an unsigned int when it was in fact signed. It took a brave field technician in Finland in winter, inspecting a unit in an unheated space to even discover this particular bug because the units' internal temperatures were usually about 20C above ambient.
Prior to this year, they could only handle 0-127 degrees for the water temperature. Which used to be sensible, but there were some issues with pressurised water starting to be delivered to houses resulting in negative temperatures being reported, like -125C, which immediately has the water switch off to prevent icing problems.
The software side also switched from COBOL to Ada. So that's kewl.
Regarding "exhausting 2-day brute-force grind": is/was this just how you like to get things done, or was there external pressure of the "don't work on anything else" sort? I've never worked at a large company, and lots of descriptions of the way things get done are pretty foreign to me :). I am also used to being able to say "this isn't getting figured out today; probably going to be best if I work on something else for a bit, and sleep on it, too".
Our team also had a very grindy culture, so "I'm going to put in extra hours focusing exclusively on our top crash" was a pretty normalized behavior. After I left that team (and Google), most of my future teams have been more forgiving on pace for non-outages.
Math.abs(Integer.MIN_VALUE) in Java very seriously returns -2147483648, as there is no int for 2147483648.
It throws an OverflowException: ("Negating the minimum value of a twos complement number is invalid.")
a = torch.tensor(-2*31, dtype=torch.int32) assert a == a.abs()
Turns out IE8 doesn't define console until the devtools are open. That caused me to pull a few hairs out.
My favourite are bugs, that not only don't appear in the debugger - but also don't reproduce anymore on normal settings after I took a closer look in the debugger (Only to come back later at a random time). Feels like chasing ghosts.
A favourite of mine was a bug (specifically, a stack corruption) that I only managed to see under instrumentation. After a lot of debugging turns out that the bug was in the instrumentation software itself, which generated invalid assembly under certain conditions (calling one of its own functions with 5 parameters even though it takes only 4). Resolved by upgrading to their latest version.
https://www.hackerneue.com/item?id=37859771
Point being that the difficulty of a fix can come from many possible places.
The more abundant the undefined (mis)behavior, the more you're going to be tearing your hair out.
Almost the kind of frustration where you're supposed to have a logic-based system, and it rears it ugly head and defies logic anyway :\
Part of it was difficulty of pinpointing the actual issue - fullness of drive vs throughput of writes.
A lot of it was unfortunately organizational politics such that the system spanned two teams with different reporting lines that didn't cooperate well / had poor testing practices.
The hardest bugs in my experience are those where your only source of vital information is a third party who is straight-up lying to you.
Though abs() returning negative numbers is hilarious.. “You had one job…”
To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.
I’m not just talking about concurrency issues either…
The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.
2 days is cute though.