On-Call Best Practices for Website Teams

Why On-Call Matters for Website Teams

On-call is the practice of having a designated person available to respond to incidents outside of normal working hours. For website teams, this means someone is responsible for responding when your monitoring alerts fire at 2am because the site is down, an SSL certificate expired, or a critical vendor outage is affecting your application.

Without on-call, incidents that happen outside business hours go unaddressed until someone happens to notice. For an e-commerce site, that could mean an entire overnight of lost sales. For a SaaS product, it could mean breached SLA commitments and angry customers who discovered your outage before you did.

On-call does not need to be painful. Done well, it is a manageable responsibility that keeps your systems healthy and your team confident. Done poorly, it burns people out, floods them with meaningless alerts, and still fails to catch the things that matter.

Setting Up On-Call Rotations

The foundation of sustainable on-call is a fair, predictable rotation schedule.

Rotation Length

Weekly rotations are the most common and generally the best balance. Daily rotations create too many handoffs and context-switching. Bi-weekly rotations leave people on-call too long, which leads to fatigue and resentment.

For small teams (2-4 people), weekly rotations mean each person is on-call every 2-4 weeks. That is manageable. If your team is too small for a rotation, that is a staffing problem, not a scheduling problem. Nobody should be on-call every week indefinitely.

Rotation Timing

Start and end rotations during business hours, typically at 10am on a weekday. This gives the outgoing person the morning to handle any lingering issues and gives the incoming person the day to settle in, review recent incidents, and verify their alerting setup works.

Never start rotations on Monday morning. If something happened over the weekend, the outgoing person has context. Let them finish their shift during business hours when handoff is clean.

Coverage Model

Primary + secondary. The primary on-call person is the first responder. The secondary is the backup if the primary does not respond within a defined window (typically 15 minutes). This redundancy is critical. People sleep through phone alarms, lose phone signal, or have unexpected personal situations.

Follow-the-sun. If your team spans time zones, rotate on-call so the person on duty is always in their normal waking hours. A team with members in US Eastern and European time zones can split 24-hour coverage without anyone getting 3am pages. This is the ideal model when team geography allows it.

Scheduling Tools

Use a dedicated scheduling tool rather than a shared calendar or spreadsheet. Tools like PagerDuty, Opsgenie, and Grafana OnCall manage rotations, handle overrides (someone needs to swap a shift), and integrate with your alerting systems. The scheduling tool should be the source of truth for who is on-call right now.

Escalation Policies

An escalation policy defines what happens when the on-call person does not respond. Without escalation, a missed alert means an unresolved incident.

Tiered Escalation

Tier 1: Primary on-call. Alert fires. The primary on-call person gets notified via their preferred urgent channel (phone call, SMS, push notification). They have 15 minutes to acknowledge the alert.

Tier 2: Secondary on-call. If the primary does not acknowledge within 15 minutes, the secondary on-call person is notified automatically. They have 15 minutes to respond.

Tier 3: Team lead or manager. If neither the primary nor secondary responds within 30 minutes total, the alert escalates to the team lead or engineering manager. At this point, the goal is not just to fix the incident but to figure out why nobody is responding.

Auto-Escalation

Configure your alerting system to escalate automatically. Do not rely on the on-call person to manually escalate when they need help. Start with less intrusive channels (push notification) and escalate to more intrusive ones (SMS, phone call). The system should bring in additional people based on predefined rules without requiring manual action.

Preventing Alert Fatigue

Alert fatigue is the number one reason on-call fails. When the on-call person gets 15 alerts a day, they stop reading them carefully. When most alerts are false positives or low-priority noise, the one critical alert gets lost in the shuffle.

Only Alert on Actionable Conditions

Every alert should require a human to do something. If the correct response to an alert is "wait and see if it resolves itself," that is not an alert. It is a log entry. Review every alert in your system and ask: "If this fires at 3am, does someone need to wake up and act?" If the answer is no, it should not page the on-call person.

Set Appropriate Thresholds

A single failed health check should not trigger an alert. Transient network issues cause brief blips constantly. Require multiple consecutive failures before alerting. For uptime monitoring, two or three consecutive failed checks from different locations is a reasonable threshold.

Response time alerts need sensible baselines. If your site normally responds in 200ms, alerting at 201ms creates noise. Alert when response times exceed a level that actually affects user experience, typically 2-3x your normal baseline.

Separate Urgency Levels

Not everything that needs attention needs attention at 3am. Create clear severity levels:

Critical pages the on-call person at any hour: site is down, data is at risk, revenue is being lost. Warning notifies during business hours: degraded performance, certificate expiring in 14 days. Informational goes to dashboards and logs only, never to a person.

Regular Alert Review

Once a month, review all alerts that fired in the previous period. Categorize them: true positive and actionable, true positive but not actionable, false positive. Any alert that consistently falls into the "not actionable" or "false positive" category should be tuned or removed.

For guidance on setting up the monitoring that feeds your on-call alerts, see our website monitoring checklist.

Runbooks

A runbook is a documented procedure for responding to a specific type of alert. Runbooks transform on-call from "figure it out under pressure" to "follow the documented steps."

What a Runbook Should Include

Alert description. What this alert means in plain language. "This alert fires when the primary site returns 5xx errors for 3 consecutive checks from 2 or more locations."

Severity and expected response time. How urgent this is and how quickly the on-call person should respond.

Diagnosis steps. A step-by-step process to determine what is causing the alert. Check the status page of your hosting provider. Verify DNS resolution. Test the site from a different network. Check recent deployments.

Resolution steps. For known failure modes, document the fix. "If the issue is caused by a recent deployment, roll back to the previous version using [specific command]." "If the hosting provider is experiencing an outage, update the status page and notify customers per the [communication template]."

Escalation criteria. When to call for help and who to contact. "If the issue is not resolved within 30 minutes, escalate to the secondary on-call. If the issue involves data loss, immediately escalate to the database team lead."

Keep Runbooks Accessible

Runbooks should be accessible from the alert itself. The best alerting tools let you link a runbook URL directly to each alert rule. When the on-call person gets paged, the runbook is one click away. Do not bury runbooks in a wiki that requires VPN access from a specific browser.

Our incident response plan template provides a framework for building out your runbooks and response procedures.

Handoff Practices

The transition between on-call shifts is a vulnerability. If the outgoing person has context about an ongoing issue that the incoming person does not know about, response times suffer.

The Handoff Checklist

At the end of every on-call shift, the outgoing person should post a short written summary in a dedicated Slack channel covering: active incidents still being monitored, recent incidents (even resolved ones, for context if they recur), known risks like scheduled maintenance or flaky vendors, and any temporary configuration changes that need reverting.

Five bullet points takes two minutes to write and can save the incoming person significant time. For vendor outages that span multiple shifts, Is That Down has a useful vendor outage response playbook that includes handoff protocols.

Compensation

On-call is real work. It restricts your personal time, disrupts your sleep, and adds stress. Compensating fairly for on-call duty is both the ethical thing to do and a practical necessity for retention.

Common Compensation Models

Flat stipend per on-call shift. A fixed amount ($200-500 per week) for being available, regardless of whether any pages fire. This compensates for the restriction on personal time.

Per-incident bonus. On top of a base stipend, a bonus for each incident responded to outside business hours. Getting woken up at 3am is worse than just being available, and compensation should reflect that.

Time off in lieu. If you get paged at 3am and spend two hours on an incident, you get that time back during the following work week. This prevents exhausted engineers dragging through their regular workday.

The most sustainable model combines a base stipend with comp time. Expecting on-call as an unpaid duty breeds resentment and creates a perverse incentive to avoid being the person who knows how to fix things.

Tools

A functional on-call setup requires a few categories of tools working together.

Monitoring. The source of alerts. Site Watcher covers website uptime, SSL, domain, and DNS monitoring. For a comparison of options, see our monitoring tools comparison. Understanding common causes of website downtime helps you configure alerts for the right failure modes.

Alerting and scheduling. PagerDuty, Opsgenie, or Grafana OnCall for managing rotations, escalation policies, and alert routing.

Communication and documentation. Slack for incident coordination, plus a wiki or Git repo for runbooks and post-mortem records. A public status page (Statuspage, Instatus) for customer communication during outages.

The best on-call setup is one where pages are rare and actionable. If your on-call person gets paged more than twice a week, the problem is not the on-call rotation. It is your system reliability or your alert configuration. Fix the root cause rather than adding more people to the rotation.

Building a Sustainable On-Call Culture

On-call works when the team sees it as a shared responsibility with real support behind it. That means fair rotations, reasonable alert volume, good runbooks, proper compensation, and management that treats on-call burden as a first-class concern.

Review your on-call practices quarterly. Ask the team: What is working? What is not? Which alerts are noisy? Which runbooks are outdated? Which incidents could have been prevented?

The goal is not zero incidents. The goal is that when incidents happen, the right person is notified quickly, has the tools and documentation to respond effectively, and is fairly compensated for their time. Everything else builds on that foundation.

For a complete framework on maintaining your website and preventing the incidents that drive on-call pages, start with our website maintenance and monitoring guide.

References

Google SRE Book - Being On-Call - Google's approach to sustainable on-call practices.
PagerDuty - On-Call Best Practices - Industry guide to on-call rotation design.
Increment Magazine - On-Call Issue - Collection of articles on on-call culture and practices.

Better monitoring means fewer on-call pages

Site Watcher monitors uptime, SSL, domain expiry, and DNS with smart alerting that reduces noise and catches real issues. Set up monitoring that respects your on-call team.