Incident Response Plan Template for Website Downtime
Build a complete incident response plan for website outages. Includes roles, severity levels, response steps, and communication templates.
Last updated: 2026-02-17
Why You Need an Incident Response Plan
When your website goes down at 2am, you do not want your team scrambling to figure out who does what. You want a plan that has already been tested, with clear roles, defined escalation paths, and communication templates ready to send.
Companies without an incident response plan take 2-3x longer to resolve outages. The chaos of an unplanned response causes confusion, duplicated effort, and missed steps that extend downtime and increase costs.
This guide gives you a complete, ready-to-adapt incident response plan for website and web service outages.
The Core Components of an Incident Response Plan
Every effective incident response plan needs five elements:
Defined Roles
Severity Classification
Response Procedures
Communication Plan
Post-Mortem Process
Incident Response Roles
Incident Commander (IC)
The Incident Commander owns the incident from detection to resolution. This person coordinates the response, makes decisions when the team disagrees, and ensures the process is followed.
Responsibilities:
- Declare the incident and assign severity
- Assemble the response team
- Coordinate investigation and remediation
- Make escalation decisions
- Authorize changes to production systems
- Declare the incident resolved
The IC does not need to be the most technical person. They need to be organized, decisive, and calm under pressure.
Technical Lead
The Technical Lead drives the actual investigation and fix. This is your most capable engineer for the affected system.
Responsibilities:
- Diagnose the root cause
- Implement the fix or workaround
- Validate the fix is working
- Document technical details for the post-mortem
- Advise the IC on estimated resolution time
Communications Lead
The Communications Lead handles all stakeholder communication so the technical team can focus on fixing the problem.
Responsibilities:
- Update the status page
- Send internal notifications to leadership and affected teams
- Draft and send customer communications
- Monitor social media and support channels for impact reports
- Provide regular updates at defined intervals
For small teams, one person may fill multiple roles. That is fine. What matters is that the responsibilities are defined and assigned before an incident happens, not during one.
Severity Levels
Define severity levels based on user impact, not technical complexity. A simple DNS change that takes down your entire site is higher severity than a complex bug that affects one page.
| Severity | Definition | Response Time | Update Frequency |
|---|---|---|---|
| SEV-1: Critical | Complete site outage or data breach affecting all users | 5 minutes | Every 15 minutes |
| SEV-2: Major | Significant functionality broken, large portion of users affected | 15 minutes | Every 30 minutes |
| SEV-3: Minor | Partial functionality degraded, subset of users affected | 1 hour | Every 2 hours |
| SEV-4: Low | Minor issue, minimal user impact, cosmetic or non-critical | 4 hours | Daily until resolved |
Severity Decision Framework
When classifying an incident, ask these questions in order:
- Is the site completely unreachable? If yes, SEV-1.
- Can users complete their primary task? (purchase, sign up, log in) If not, SEV-1 or SEV-2.
- How many users are affected? More than 50% = SEV-2. Less than 50% = SEV-3.
- Is there a workaround? If users can accomplish their goal another way, consider lowering severity by one level.
- Is the issue getting worse? Escalate if the impact is spreading.
Response Procedures
Phase 1: Detection and Triage (0-5 minutes)
Alert Received
Acknowledge the Alert
Initial Assessment
Classify Severity
Declare the Incident
Detect incidents in seconds, not hours
Site Watcher monitors uptime, SSL, DNS, and domain status continuously. Get alerted through email, Slack, or webhooks the moment something breaks.
Phase 2: Investigation and Diagnosis (5-30 minutes)
Assemble the Response Team
Gather Context
Check Recent Changes
Identify the Root Cause
Communicate Status
Phase 3: Remediation (15 minutes - several hours)
Decide on Approach
Implement the Fix
Validate the Fix
Monitor for Recurrence
Declare Resolution
Phase 4: Post-Mortem (Within 48 hours)
Write the Post-Mortem Report
Hold the Post-Mortem Meeting
Identify Action Items
Track Action Items to Completion
Communication Templates
Status Page: Investigating
[Service Name] - Investigating Reports of [Issue Description]
We are currently investigating reports of [brief description of the issue]. Our team is actively working on this. We will provide an update within [timeframe based on severity level].
Impact: [Description of what users are experiencing]
Started: [Time in UTC]
Status Page: Identified
[Service Name] - Issue Identified
We have identified the cause of [brief description]. [One sentence on what the cause is, if appropriate to share]. Our team is implementing a fix now. We expect to have this resolved within [estimated time].
Impact: [Updated description of what users are experiencing]
Status Page: Resolved
[Service Name] - Resolved
The issue affecting [brief description] has been resolved. All systems are operating normally.
Duration: [Start time] to [End time] ([total duration])
Root cause: [Brief, non-technical explanation]
We apologize for any inconvenience. We are implementing measures to prevent this from recurring.
Internal Notification: SEV-1
INCIDENT: SEV-1 - [Brief Description]
Status: Active IC: [Name] Started: [Time] Impact: [What is affected and how many users] Incident Channel: [Link]
We are actively investigating. Updates will be provided every 15 minutes.
Never share technical details about security incidents in public communications. For security-related outages, keep public updates focused on impact and resolution status only.
Post-Mortem Template
Use this structure for every post-mortem document:
1. Summary - One paragraph describing what happened, when, and the impact.
2. Timeline - Chronological list of events from first detection to resolution. Include timestamps.
3. Root Cause - Technical explanation of what caused the incident.
4. Impact - Quantified: duration, number of users affected, revenue lost, SLA impact.
5. What Went Well - Things that worked during the response.
6. What Went Wrong - Things that did not work or made the situation worse.
7. Action Items - Specific, assigned, time-bound tasks to prevent recurrence.
Integrating Monitoring into Your Response Plan
Your incident response plan is only as fast as your detection. Manual detection (waiting for customer reports) typically adds 30-60 minutes to incident duration. Automated monitoring reduces detection to seconds.
Multi-Channel Alerting
Context-Rich Alerts
Escalation Rules
Status Page Integration
On-Call Schedule Best Practices
- Rotate on-call weekly to prevent burnout
- Ensure on-call engineers have access to all necessary systems and credentials
- Provide a runbook for common issues that on-call may encounter
- Follow the sun: if your team spans time zones, schedule on-call during waking hours for each region
- Compensate on-call time fairly. Engineers who carry pagers should be recognized for it
Testing Your Plan
An untested plan fails when you need it most. Test quarterly at minimum.
Tabletop exercises: Walk through a hypothetical incident as a group. The IC reads the scenario, and team members describe what they would do at each phase.
Simulated incidents: Intentionally inject a controlled failure (in staging, not production) and run through the full response process.
Chaos engineering: For mature teams, use tools like Chaos Monkey to randomly inject failures in production and validate that your monitoring, alerting, and response processes work end-to-end.
A good incident response plan does not prevent incidents. It turns a 4-hour panic into a 30-minute process with clear roles, fast communication, and a path to resolution that everyone understands before the pressure hits.
The First Step Is Fast Detection
Site Watcher monitors uptime, SSL, DNS, domains, and vendor dependencies continuously. Get the alert that triggers your response plan within seconds. $39/mo unlimited, free for 3 targets.