Cloud & Infrastructure · Level 3
When it breaks
Mission briefing
Mission 3-C: 2:14am. The site is down. The incident channel is a blur of messages. You got pulled in because the error page users are seeing is one you designed — and right now it's the only thing standing between the company and a totally blank screen.
We're in downtime — the site is unreachable for users. The clock matters; every minute counts. Right now your error page is doing real work, telling people we know and we're on it instead of showing a scary blank screen.
This is a formal incident — that's the word for an active outage we're coordinating a response to. There's a process: identify, mitigate, recover, then write up what happened. You're part of the response because the user-facing message is yours.
We're reading the logs — the timestamped record of everything the system did right before it fell over. The logs show the primary database stopped responding. We're triggering failover — switching to a standby copy that takes over when the main one dies.
Failover kicks in. The site flickers back. My error page did its job for eleven minutes — and now I'm thinking about how to make it calmer and clearer for next time, because there's always a next time.
During the outage, what's the most useful thing your error page can do?