18 May 2025 3 min read Debugging

Debugging in Production: Without Panic

Let’s talk about the moment every developer dreads: something blows up in prod while you’re sipping coffee, feeling smug about the morning deploy. We’ve all been there—alert storms, Slack pinging like it’s New Year’s Eve, and that creeping sense of doom.

But here’s the good news: production incidents don’t have to send you spiraling. Today I’ll share how our team learned to debug in production—calmly, methodically, and (mostly) without wrecking our blood pressure.

Example Incident

Picture this hypothetical example: a Friday afternoon, traffic spikes for a seasonal promo, and our brand-new caching layer decides it’s had enough. Orders start timing out. Customers refresh. More timeouts. You get the idea.

What did they do? Exactly what not to do—SSH into random pods, tail logs like headless chickens, and poke the database hoping for divine insight. They fixed it eventually… but they knew they needed a better playbook.

Build Observability Before You Need It

After that fiasco, we invested in three pillars:

Pillar	What We Added	Why It Helped
Structured Logging	JSON logs with trace IDs	One search showed the entire request path.
Metrics & Dashboards	Datadog	We saw the exact spike in latency in seconds.
Distributed Tracing	OpenTelemetry	Followed a failing request across five services.

Lesson: instrument first, debug later. If prod is a black box, you’ll waste precious minutes (or hours) guessing.

With Datadog you can also create SLOs and Monitors and can send alert to your slack channel or Opsgenie to inform the oncall person. This type of proactive monitoring reduces the risk for the customers.

Keep a “Panic Playbook”

We wrote a tiny Markdown doc called PANIC_MODE.md and pinned it in Slack. It contains:

Alert Triage Checklist – Is it customer-facing? Can we roll back?
Quick Mute Commands – Silence noisy alerts so you can think.
Runbooks – Step-by-step fixes for all the known recurring issues.
Who to Wake Up – A rotation list, so you’re not “that person” texting everyone at 2 a.m. Or better have something configured like Opsgenie which alerts the oncall person and not the whole team

Having a playbook sounds obvious, but the moment stress hits, a checklist is gold.

Roll Forward, Not (Always) Backward

Hot take: rollbacks are overrated if the root cause is config or data. We lean on feature flags and config toggles. If you haven't yet read the article on how you must embrace feature flags, then I highly recommend you to please read that.

Blameless (but Ruthless) Post-Mortems

When dust settles, we run a 15-20 minute retro:

Timeline – Facts only, minute-by-minute.
Root cause(s) – Usually plural.
Action items – Always tracked, never hand-waved.
“Could this happen again?” – If answer is yes, we’re not done.

No finger-pointing—just fixing the system so the same surprise won’t bite us twice.

Tiny Habits That Keep You Calm

Quick wins we adopted:

Every new feature behind a feature flag
Proactive monitoring - creation of SLOs and Monitors for core features first and then with every new feature some level of observability
Auto-tag every deploy with the Git SHA or some kind of identifier, in our case it is some prefix attached to the date and time of the moment release candidate was created

Final Thoughts

Debugging in production will never be a spa day, but it doesn’t have to be chaos. Invest in observability, keep a playbook, and practice drills like you’re the ops version of a fire brigade. The next time prod misbehaves, you’ll reach for your tools instead of your stress ball.

Have you survived a memorable prod incident? Hit reply or drop a comment—I’d love to hear how you kept your cool (or didn’t 😉).

Until next time,
Keep Learning. Keep Shipping.