On-call doesn’t scale with cleverness; it scales with boring discipline. After a decade of pagers, three continents, and one memorable incident involving a misbehaving cron job in Ismailia, this is the shortlist I actually hand to new rotation-mates. It fits on one page on purpose — if it doesn’t, you’re writing a manual nobody will read at 03:00.
The first rule is: nobody’s reading
At 03:00, you’re not reading. You’re pattern-matching. Runbooks that assume you’ll calmly absorb a wall of text are writing fiction. The pages I’ve written and kept have three things: a one-line symptom, the exact command to run, and the exact thing to check next. The paragraphs-of-explanation version gets archived within six months because nobody opens it during an actual incident.
If your runbook has a “Background” section, that section is for future you, not on-call you. Put it at the bottom.
The second rule is: the first action is “tell people”
Before you touch anything — before you run the rollback, before you scale the pool, before you kick a node — post in the incident channel. Two lines: what you’re seeing and what you’re about to do. That’s it.
This feels like friction. It is friction. It’s the cheapest friction you will ever introduce into your on-call. It prevents three-engineer diagnostics where everybody’s running contradicting commands and the database has been restarted twice in four minutes. Ask me how I know.
Ranked list of things to check first
The order matters. If your dashboard doesn’t make these five things two-clicks-deep or less, that’s a pre-incident problem you’re ignoring.
- Did we deploy in the last hour? 60% of your pages are this, and you know it. Rollback, then debug. Debug-then-rollback is how you end up with a three-hour outage and an extremely polite Slack thread about whose change it was.
- Is the database fine? CPU, connection pool saturation, replication lag. The database is almost always the slowest thing in your request path, and it’s the thing that takes longest to recover when you hurt it.
- Is it just one region / AZ / partition? If the thing looks broken
but is only broken in
eu-central-1, the fix is probably at the load-balancer layer, not in your code. - Did an upstream API change? Providers push breaking changes on weekdays at 10am and pretend it was in the changelog.
- Did the traffic shape change? Someone’s integration went live and now you’re taking 40× the requests. Check the source IP distribution before you start rewriting a service.
If you don’t find the cause in those five, slow down. The next step is always “read the logs, slowly, from the edge inward” — not “guess and restart things.”
Heroics are a smell
The first time I did a 14-hour incident solo and everyone clapped, I thought that was the job. The tenth time I did one I realised I’d been carrying a system that should have paged more people, sooner.
Signs you are being a hero instead of an engineer:
- You’re the only one who understands the runbook.
- The dashboards you use don’t exist anywhere except in your bookmarks.
- You know the fix by muscle memory — and have never written it down.
- You’ve decided not to escalate because “it’s faster if I just do it.”
Every one of those is a post-incident action item. “Document the runbook,” “move the dashboard to the shared folder,” “add a section to the training deck.” Your heroics are tech debt. Pay it down before your rotation becomes someone else’s inherited trauma.
The pager is not a performance review
If you take one thing from this post: the pager fires because systems fail. Systems fail because engineering is hard. You are not slower, worse, or less senior because the pager woke you up. You’re the person who showed up.
The anti-pattern here is post-incident reviews that ask “who should have caught this?” The answer is always “the system” — the CI, the test, the alarm, the guardrail, the design review that didn’t ask the right question. The “who” question is shame theater and it makes the next incident harder because people will delay paging to avoid being Named.
Specific habits that keep me sane
These are the boring ones that compound:
- I keep a personal pager log. Time, page, what I checked first, what the actual cause was, what I would’ve checked if I’d known. Five lines per incident. After a year, patterns emerge that the formal post-mortem process misses.
- I silence my pager with deliberate confirmation, not reflex. If the alert is wrong, I fix the alert that night. If I silence it and forget, it’ll page me again at 04:00 and I will deserve it.
- I treat “paged but not an incident” as a bug. False pages are a priority-0 operational issue, not a folk tax. If your team normalizes waking up for a non-event, your rotation is slowly unravelling.
- I do the handoff in writing. Always. “Nothing to report” is a sentence. “I ignored the 03:00 page, it’ll probably self-resolve” is a handoff; “goodnight” is an abandonment.
The runbook template that actually gets used
## Symptom
<one sentence, the thing the pager said>
## Immediate action
<the command, or the dashboard URL, or the "escalate to X">
## Verification
<how you know the thing worked>
## Next steps if it didn't work
<three bullets max, ordered>
## Background
<everything else — the why, the history, the links>
If a runbook can’t fit the top four sections on one screen, it’s two runbooks pretending to be one. Split it.
And finally: sleep
You can’t engineer your way out of sleep deprivation. If the rotation is waking you up more than once a quarter, that’s an engineering problem — not a personal resilience problem — and your job is to fix it by making the system quieter, not by getting tougher.
I run a local-first agent control plane (Fulcrum) partly because I got tired of watching teams reinvent the same incident-response habits I wrote down a decade ago. Pagers aren’t the enemy. The enemy is the organisation that treats a noisy pager as a personality trait of the people holding it.