Incident Response Plan Template for Website Downtime

Why You Need an Incident Response Plan

When your website goes down at 2am, you do not want your team scrambling to figure out who does what. You want a plan that has already been tested, with clear roles, defined escalation paths, and communication templates ready to send.

Companies without an incident response plan take 2-3x longer to resolve outages. The chaos of an unplanned response causes confusion, duplicated effort, and missed steps that extend downtime and increase costs. For a broader perspective, see our Website Maintenance and Monitoring Guide.

This guide gives you a complete, ready-to-adapt incident response plan for website and web service outages.

The Core Components of an Incident Response Plan

Every effective incident response plan needs five elements:

Defined Roles

Who is responsible for what during an incident. No ambiguity, no overlap.

Severity Classification

A consistent framework for evaluating how bad an incident is and what response it triggers.

Response Procedures

Step-by-step actions for each severity level, from detection to resolution.

Communication Plan

Who gets told what, when, and through which channels. Internal and external.

Post-Mortem Process

A structured way to learn from every incident and prevent recurrence.

Incident Response Roles

Incident Commander (IC)

The Incident Commander owns the incident from detection to resolution. This person coordinates the response, makes decisions when the team disagrees, and ensures the process is followed.

Responsibilities:

Declare the incident and assign severity
Assemble the response team
Coordinate investigation and remediation
Make escalation decisions
Authorize changes to production systems
Declare the incident resolved

The IC does not need to be the most technical person. They need to be organized, decisive, and calm under pressure.

Technical Lead

The Technical Lead drives the actual investigation and fix. This is your most capable engineer for the affected system.

Responsibilities:

Diagnose the root cause
Implement the fix or workaround
Validate the fix is working
Document technical details for the post-mortem
Advise the IC on estimated resolution time

Communications Lead

The Communications Lead handles all stakeholder communication so the technical team can focus on fixing the problem.

Responsibilities:

Update the status page
Send internal notifications to leadership and affected teams
Draft and send customer communications
Monitor social media and support channels for impact reports
Provide regular updates at defined intervals

For small teams, one person may fill multiple roles. That is fine. What matters is that the responsibilities are defined and assigned before an incident happens, not during one.

Severity Levels

Define severity levels based on user impact, not technical complexity. A simple DNS change that takes down your entire site is higher severity than a complex bug that affects one page.

Severity	Definition	Response Time	Update Frequency
SEV-1: Critical	Complete site outage or data breach affecting all users	5 minutes	Every 15 minutes
SEV-2: Major	Significant functionality broken, large portion of users affected	15 minutes	Every 30 minutes
SEV-3: Minor	Partial functionality degraded, subset of users affected	1 hour	Every 2 hours
SEV-4: Low	Minor issue, minimal user impact, cosmetic or non-critical	4 hours	Daily until resolved

Severity Decision Framework

When classifying an incident, ask these questions in order:

Is the site completely unreachable? If yes, SEV-1.
Can users complete their primary task? (purchase, sign up, log in) If not, SEV-1 or SEV-2.
How many users are affected? More than 50% = SEV-2. Less than 50% = SEV-3.
Is there a workaround? If users can accomplish their goal another way, consider lowering severity by one level.
Is the issue getting worse? Escalate if the impact is spreading.

Response Procedures

Phase 1: Detection and Triage (0-5 minutes)

Alert Received

Monitoring system detects an issue and sends an alert through configured channels (email, SMS, Slack, PagerDuty).

Acknowledge the Alert

The on-call responder acknowledges within 5 minutes. If unacknowledged, the alert escalates to the backup on-call.

Initial Assessment

Verify the issue is real (not a false positive). Check monitoring dashboards, try accessing the site manually, check recent deployments.

Classify Severity

Using the severity framework, assign an initial severity level. This can be adjusted later as more information becomes available.

Declare the Incident

If SEV-1 or SEV-2, formally declare an incident. Open an incident channel, page the Incident Commander, and begin the response process.

Detect incidents in seconds, not hours

Site Watcher monitors uptime, SSL, DNS, and domain status continuously. Get alerted through email, Slack, or webhooks the moment something breaks.

Phase 2: Investigation and Diagnosis (5-30 minutes)

Assemble the Response Team

IC pages the Technical Lead and Communications Lead. For SEV-1, all relevant engineers join the incident channel.

Gather Context

Review monitoring data: when did the issue start? What changed recently? Are there correlated alerts? Check deployment logs, infrastructure changes, and third-party status pages.

Check Recent Changes

The most common cause of outages is a recent change. Check deployments, configuration changes, DNS updates, and certificate renewals from the last 24 hours.

Identify the Root Cause

Narrow down the failing component. Is it the application, the database, DNS, SSL, a third-party service, or infrastructure?

Communicate Status

Communications Lead posts the first public update to the status page and sends internal notifications.

Phase 3: Remediation (15 minutes - several hours)

Decide on Approach

Choose between a full fix or a temporary workaround. When downtime is ongoing, a workaround that restores service quickly is almost always better than a perfect fix that takes hours.

Implement the Fix

Technical Lead implements the fix or workaround. For SEV-1, a second engineer should review changes before they are applied to production.

Validate the Fix

Verify the fix is working from multiple perspectives: monitoring tools, manual testing, different geographic locations, different browsers/devices.

Monitor for Recurrence

Watch closely for 30-60 minutes after the fix to ensure the issue does not return.

Declare Resolution

IC declares the incident resolved. Communications Lead sends final updates to all stakeholders.

Phase 4: Post-Mortem (Within 48 hours)

Write the Post-Mortem Report

Document what happened, when, the impact, the timeline, the root cause, and what was done to fix it.

Hold the Post-Mortem Meeting

Blameless review with all involved parties. Focus on systems and processes, not individuals.

Identify Action Items

Define concrete steps to prevent recurrence. Assign owners and deadlines to each action item.

Track Action Items to Completion

Add follow-up items to your project management system. Review progress weekly until all items are complete.

Communication Templates

Status Page: Investigating

[Service Name] - Investigating Reports of [Issue Description]

We are currently investigating reports of [brief description of the issue]. Our team is actively working on this. We will provide an update within [timeframe based on severity level].

Impact: [Description of what users are experiencing]

Started: [Time in UTC]

Status Page: Identified

[Service Name] - Issue Identified

We have identified the cause of [brief description]. [One sentence on what the cause is, if appropriate to share]. Our team is implementing a fix now. We expect to have this resolved within [estimated time].

Impact: [Updated description of what users are experiencing]

Status Page: Resolved

[Service Name] - Resolved

The issue affecting [brief description] has been resolved. All systems are operating normally.

Duration: [Start time] to [End time] ([total duration])

Root cause: [Brief, non-technical explanation]

We apologize for any inconvenience. We are implementing measures to prevent this from recurring.

Internal Notification: SEV-1

INCIDENT: SEV-1 - [Brief Description]

Status: Active IC: [Name] Started: [Time] Impact: [What is affected and how many users] Incident Channel: [Link]

We are actively investigating. Updates will be provided every 15 minutes.

Never share technical details about security incidents in public communications. For security-related outages, keep public updates focused on impact and resolution status only.

Post-Mortem Template

Use this structure for every post-mortem document:

1. Summary - One paragraph describing what happened, when, and the impact.

2. Timeline - Chronological list of events from first detection to resolution. Include timestamps.

3. Root Cause - Technical explanation of what caused the incident.

4. Impact - Quantified: duration, number of users affected, revenue lost, SLA impact.

5. What Went Well - Things that worked during the response.

6. What Went Wrong - Things that did not work or made the situation worse.

7. Action Items - Specific, assigned, time-bound tasks to prevent recurrence.

Integrating Monitoring into Your Response Plan

Your incident response plan is only as fast as your detection. Manual detection (waiting for customer reports) typically adds 30-60 minutes to incident duration. Automated monitoring reduces detection to seconds.

Multi-Channel Alerting

Route alerts through email, SMS, Slack, and webhooks. Different severity levels can trigger different channels. SEV-1 should page someone immediately.

Context-Rich Alerts

Good alerts include what failed, when it started, and what might have changed. This gives the on-call responder a head start on diagnosis.

Escalation Rules

If the primary on-call does not acknowledge within 5 minutes, automatically escalate to the backup. If the backup does not respond, escalate to the engineering manager.

Status Page Integration

Automatically update your status page when monitoring detects an issue. This reduces the communications burden during active incidents.

On-Call Schedule Best Practices

Rotate on-call weekly to prevent burnout
Ensure on-call engineers have access to all necessary systems and credentials
Provide a runbook for common issues that on-call may encounter
Follow the sun: if your team spans time zones, schedule on-call during waking hours for each region
Compensate on-call time fairly. Engineers who carry pagers should be recognized for it

Testing Your Plan

An untested plan fails when you need it most. Test quarterly at minimum.

Tabletop exercises: Walk through a hypothetical incident as a group. The IC reads the scenario, and team members describe what they would do at each phase.

Simulated incidents: Intentionally inject a controlled failure (in staging, not production) and run through the full response process.

Chaos engineering: For mature teams, use tools like Chaos Monkey to randomly inject failures in production and validate that your monitoring, alerting, and response processes work end-to-end.

For related guidance, see website downtime causes and prevention, uptime monitoring explained, and the website monitoring checklist. PagerDuty's State of Digital Operations report found that organizations with documented incident response plans reduce MTTR by 25%. For detecting vendor-specific outages, vendor status monitoring complements your own infrastructure monitoring.

A good incident response plan does not prevent incidents. It turns a 4-hour panic into a 30-minute process with clear roles, fast communication, and a path to resolution that everyone understands before the pressure hits.

The First Step Is Fast Detection

Site Watcher monitors uptime, SSL, DNS, domains, and vendor dependencies continuously. Get the alert that triggers your response plan within seconds. $39/mo unlimited, free for 3 targets.

Why You Need an Incident Response Plan

The Core Components of an Incident Response Plan

Defined Roles

Severity Classification

Response Procedures

Communication Plan

Post-Mortem Process

Incident Response Roles

Incident Commander (IC)

Technical Lead

Communications Lead

Severity Levels

Severity Decision Framework

Response Procedures

Phase 1: Detection and Triage (0-5 minutes)

Alert Received

Acknowledge the Alert

Initial Assessment

Classify Severity

Declare the Incident

Phase 2: Investigation and Diagnosis (5-30 minutes)

Assemble the Response Team

Gather Context

Check Recent Changes

Identify the Root Cause

Communicate Status

Phase 3: Remediation (15 minutes - several hours)

Decide on Approach

Implement the Fix

Validate the Fix

Monitor for Recurrence

Declare Resolution

Phase 4: Post-Mortem (Within 48 hours)

Write the Post-Mortem Report

Hold the Post-Mortem Meeting

Identify Action Items

Track Action Items to Completion

Communication Templates

Status Page: Investigating

Status Page: Identified

Status Page: Resolved

Internal Notification: SEV-1

Post-Mortem Template

Integrating Monitoring into Your Response Plan

Multi-Channel Alerting

Context-Rich Alerts

Escalation Rules

Status Page Integration

On-Call Schedule Best Practices

Testing Your Plan

The First Step Is Fast Detection

Related Articles