top of page
outlier-logo.png

Build for the person logging on at 3 am when the world’s on fire



Your systems will fail at some point. Not because you’re bad at your job, but because the world has a habit of throwing curveballs at us. And there are only so many we can prepare for.


In fact, as a design philosophy, building your systems so you can diagnose and recover them when they fail is a great idea. If you’re the person who gets the call at 3 am when a mission-critical system tanks, what do you need to get it back online in the shortest possible time?


I’ve been there. The call came through at 3 am. One unusual contract created a cascade of failures. The system collapsed. The consequences were serious. If we didn’t fix the problem by 7 am, we’d be fined £500,000 for every 15 minutes we weren’t producing correct results.


In this instance, we had instrumentation, telemetry and logs at our fingertips. We identified exactly when the system went down and what it was doing when it crashed. Then we zeroed in on the offending contract.


What we didn’t have was a mechanism to exclude contracts from processing. So, at 3.30 am, with the clock ticking, our only option was to apply a hot fix to code in production. Scary stuff. We tested as far as we could and got back up and running, avoiding a significant fine.


The next morning over several large coffees, our main takeaway was the cascading failures. We had to make the system more resilient. Then we focused on strengthening our testing. The contract was unusual but certainly not outside the realms of possibility. Finally we needed to build the exclusion mechanism, so we had a code free route to switch off any contracts causing us problems in future.


Without instrumentation, logs and telemetry, we’d have been flying blind. And in this case, the issue was so obscure I’d be surprised if we’d have ever found it without those logs. Certainly not by that expensive deadline.


If you ever wonder whether you can skip logs, remember this story. They’re not a nice-to-have. It’s impossible to support a platform you can’t observe effectively. Especially when the dreaded call comes through at 3 am.

bottom of page