Outage Post-Mortem Guide

What a Post-Mortem Is (and Is Not)

A post-mortem is a structured review conducted after a website outage or incident. The goal is to understand what happened, why it happened, and what changes will prevent it from happening again. It is not a blame session. It is not a formality. It is the single most effective tool for reducing repeat incidents.

The term comes from medicine, where post-mortem examinations determine cause of death. In engineering, the "death" is the incident, and the examination is about understanding the systemic factors that caused it.

Good post-mortems produce concrete action items that actually get implemented. Bad post-mortems produce documents that sit in a shared drive untouched until the same incident happens again three months later.

Why Post-Mortems Matter

Without a post-mortem process, incidents repeat. Your team fixes the immediate problem, everyone goes back to their normal work, and the underlying conditions that caused the outage remain in place. The next time those conditions align, the same type of incident occurs.

Research from Google's Site Reliability Engineering team shows that organizations with disciplined post-mortem practices reduce repeat incidents by 40-60%. The post-mortem is where that institutional learning happens.

Post-mortems also build team confidence. When people see that incidents lead to real improvements rather than finger-pointing, they become more willing to surface problems early, report near-misses, and propose systemic fixes. This feedback loop is what separates teams that improve over time from teams that fight the same fires repeatedly.

Blameless Culture

This is the most important concept in incident review, and the one most teams get wrong.

A blameless post-mortem focuses on systems, processes, and conditions rather than individual mistakes. The question is never "who messed up?" The question is "what about our systems allowed this to happen?"

This is not about letting people off the hook. It is about recognizing that individuals make mistakes in the context of the systems they operate within. If an engineer deployed a bad config change that caused an outage, the interesting question is not "why did this person make a mistake?" Humans make mistakes. The interesting questions are: Why did the deployment process allow a bad config to reach production? Why was there no automated validation? Why did the monitoring not catch it immediately?

When people fear punishment, they hide information. They downplay their role in an incident. They avoid reporting near-misses. All of this makes your systems less safe, not more.

Practical rules for blameless post-mortems:

Use "the system" as the subject, not names. "The deployment pipeline did not include a config validation step" rather than "Alex deployed without checking."
Focus on contributing factors, not a single root cause. Real incidents almost always have multiple contributing factors.
Treat every finding as an opportunity to improve the system.
Never use post-mortem findings in performance reviews.

When to Run a Post-Mortem

Not every incident requires a full post-mortem. Running one for every minor blip creates process fatigue and devalues the practice. Here are reasonable thresholds.

Always run a post-mortem when:

Customer-facing downtime exceeded 30 minutes.
Data loss or corruption occurred, regardless of scope.
The incident required manual intervention from more than one team.
The incident was a repeat of a previous incident.
Revenue was directly impacted.

Consider a post-mortem when:

The incident was brief but revealed a systemic vulnerability.
The response process had significant confusion or delays.
A near-miss occurred that could have been a major incident.

For guidance on building the monitoring that catches these incidents quickly, see our website maintenance and monitoring guide.

The Post-Mortem Meeting Structure

The meeting should happen within 48 hours of the incident, while memories are fresh. Keep it focused and time-boxed.

Before the Meeting (30 Minutes of Prep)

The incident lead (whoever coordinated the response) should prepare a draft timeline of events before the meeting. This is not the final document. It is a starting point for discussion.

Gather monitoring data, alert logs, chat transcripts, and any other artifacts from the incident. Having this data in the meeting prevents arguments about what happened when.

Invite everyone involved in the detection, response, and resolution. Also invite stakeholders who were affected but not involved in the fix. Their perspective on the customer impact is valuable.

The Meeting (60 Minutes Max)

Timeline review (20 minutes). Walk through the incident chronologically. Start from the first sign of trouble (alert fired, customer report, team member noticed) through to full resolution. Let each participant add context from their perspective. The goal is a shared, accurate understanding of what happened.

Contributing factors (15 minutes). Identify every factor that contributed to the incident occurring or being worse than it needed to be. This is broader than "root cause." A server misconfiguration might be the root cause, but contributing factors might include: no automated config validation, monitoring that took 10 minutes to detect the issue, an unclear escalation path that added 20 minutes to the response. For metrics that should inform this analysis, review the incident response metrics guide on Website Uptime Monitor.

What went well (5 minutes). Identify what worked during the response. This matters. It reinforces good practices and gives credit where it is due. Maybe the alerting fired quickly, the on-call person responded within minutes, or the communication to customers was clear and timely.

Action items (20 minutes). For each contributing factor, propose a specific, assignable, time-bound action item. "Improve monitoring" is not an action item. "Add a health check for the payment service endpoint with a 2-minute alert threshold, assigned to Sarah, due by April 30" is an action item.

After the Meeting

The incident lead writes up the formal post-mortem document (template below), circulates it for review by attendees, and publishes it to the team's incident log. Action items go into whatever task tracking system the team uses, with owners and deadlines.

What to Document

The post-mortem document is the artifact that outlasts the meeting. It needs to be clear enough that someone who was not in the room can understand what happened and what changed as a result.

Incident Summary

Two to three sentences covering what happened, the duration, and the customer impact. This is the executive summary for people who will not read the full document.

Timeline

A chronological list of events with timestamps. Include when the issue started, when it was detected, key response actions, and when it was resolved. Be specific about times. "Around 2pm" is less useful than "14:03 UTC."

If your incident response plan was followed, note where it was followed and where the team deviated.

Root Cause

A clear explanation of the technical cause of the incident. Write it so a competent engineer outside your team could understand it. Avoid jargon that only makes sense to people who were there.

Contributing Factors

Everything that made the incident more likely to occur, harder to detect, or slower to resolve. This section is usually more valuable than the root cause section because it surfaces systemic issues rather than just the immediate trigger.

Common contributing factors include:

Missing or insufficient monitoring.
Unclear ownership or escalation paths.
Lack of automated testing or validation.
Documentation gaps.
Single points of failure.
Alert fatigue causing slow response.

For an overview of common website failure modes to watch for, see our guide on website downtime causes and prevention.

Impact

Quantify the impact as precisely as possible. Duration of customer-facing downtime. Number of affected users or transactions. Revenue impact if calculable. Support tickets generated. SLA compliance impact.

What Went Well

List the things that worked during detection and response. This reinforces good practices and provides a balanced view.

Action Items

Each action item should include: a description of the change, the person responsible, a deadline, and the expected outcome. Number them for easy reference in follow-up discussions.

Post-Mortem Template

Here is a template you can copy and adapt for your team.

Incident Title: [Brief descriptive title]

Date: [Date of incident]

Duration: [Total duration of customer-facing impact]

Severity: [Critical / Major / Minor]

Incident Lead: [Name]

Summary: [2-3 sentence overview of the incident and its impact]

Timeline:

[HH:MM UTC] - [Event description]
[HH:MM UTC] - [Event description]
[HH:MM UTC] - [Event description]

Root Cause: [Clear technical explanation]

Contributing Factors:

[Factor and explanation]
[Factor and explanation]
[Factor and explanation]

Impact:

Duration: [X hours, Y minutes]
Users affected: [Number or percentage]
Revenue impact: [Amount or "not calculable"]
Support tickets: [Number]

What Went Well:

[Item]
[Item]

Action Items:

| # | Description | Owner | Deadline | Status | |---|-------------|-------|----------|--------| | 1 | [Action item] | [Name] | [Date] | Open | | 2 | [Action item] | [Name] | [Date] | Open | | 3 | [Action item] | [Name] | [Date] | Open |

Reviewed by: [Names of attendees]

Distributing Findings

A post-mortem that only the incident responders read is a missed opportunity. Different audiences need different levels of detail.

Internal Engineering Team

Share the full post-mortem document. Engineers benefit from the technical detail, the contributing factors analysis, and the specific action items. Post it in a central location where people can reference it during future incidents.

Broader Organization

Share a summary version: what happened, customer impact, what you are doing about it. Skip the technical details. Leadership and non-technical teams need to understand the business impact and the plan for improvement, not the specifics of a database connection pool exhaustion.

Customers (When Appropriate)

For significant incidents, consider publishing a public post-incident review. This does not need to include internal process details. Cover what happened, what the impact was, and what you are doing to prevent recurrence. Companies that publish honest post-incident reviews consistently earn more trust than those that stay silent.

Incident Log

Maintain a running log of all post-mortems. Over time, this log reveals patterns: recurring types of incidents, common contributing factors, areas of the system that are fragile. Quarterly reviews of this log often surface systemic improvements that individual post-mortems miss.

Google's Site Reliability Engineering book, freely available at sre.google, provides one of the most comprehensive treatments of blameless post-mortem culture. Their approach has influenced incident management practices across the industry.

Following Through on Action Items

The post-mortem is only as good as the follow-up. Track action items in your team's existing project management tool (Jira, Linear, Asana, whatever you use). Assign clear owners and realistic deadlines.

Review open post-mortem action items weekly in your team standup or planning meeting. If action items consistently slip, that is a signal that your team is overcommitting or under-prioritizing incident prevention work.

Some teams dedicate a fixed percentage of engineering time (10-20%) to reliability work, which includes post-mortem action items. This prevents reliability improvements from being perpetually deprioritized in favor of feature work.

References

Google SRE Book - Postmortem Culture - Google's comprehensive guide to building a blameless post-mortem culture.
Etsy - Blameless Post-Mortems - Etsy's influential approach to blameless incident review.
PagerDuty - Post-Mortem Process - PagerDuty's open-source post-mortem documentation and templates.

Catch outages before they become post-mortems

Site Watcher monitors your website's uptime, SSL, domain, and DNS so you can detect issues before they turn into incidents worth reviewing.

How to Run a Website Outage Post-Mortem