What Is On-Call?

On-call is the practice of designating engineers, on a rotating schedule, to respond to alerts and incidents at any hour, including nights and weekends. It is the human side of fast recovery: monitoring detects a problem, but a person on-call is who acknowledges the alert and starts fixing it.

Automated monitoring can detect an outage within a minute, but detection alone restores nothing. On-call closes that gap by guaranteeing there is always a named person responsible for responding, even at 3 a.m. When a check fails, an alert is routed to whoever holds the on-call rotation, they acknowledge it, diagnose the cause, and either fix it or escalate. Without an on-call rotation, an outage that starts overnight may sit unaddressed until the next morning, turning a 10-minute fix into hours of downtime.

How on-call rotations work

On-call is organised as a schedule that hands the responsibility around a team so no single person carries it permanently. A weekly rotation is the most common pattern: each engineer is primary on-call for a week, then rotates off. Most setups layer in a few standard mechanisms:

  • Primary and secondary: a backup on-call who is paged if the primary does not acknowledge within a set time.
  • Escalation policy: if an alert is unacknowledged after, say, 5 minutes, it escalates to the secondary, then to a team lead or manager.
  • Follow-the-sun: for global teams, rotations hand off across time zones so no one is paged in the middle of the night.
  • Severity tiers: only high-severity alerts page a human out of hours; low-priority issues wait for business hours.

Why on-call matters for recovery

On-call directly shapes MTTR (mean time to recovery), the average time it takes to restore service after a failure. MTTR breaks into stages: detect, alert, acknowledge, diagnose, fix, and verify. A well-run on-call rotation compresses the alert and acknowledge stages, which is often where minutes leak. If monitoring detects an outage in 1 minute but nobody is paged for 30, the acknowledgement delay dominates total recovery time. Because availability follows Availability = MTBF / (MTBF + MTTR), shaving recovery time lifts the same headline uptime percentage that targets and SLAs are written against, and the margins are tight: 99.9% uptime allows only about 43 minutes of downtime per month, so a single slow overnight response can exhaust most of the budget. For a full breakdown of those thresholds, see what 99.9% uptime actually means.

On-call also feeds the rest of the incident lifecycle. The on-call engineer typically updates the status page so customers know the team is responding, and after the incident is resolved they help write the postmortem that captures what happened and how to prevent a repeat. Sustainable on-call depends on good alerts: too many false pages cause alert fatigue, where engineers start ignoring notifications, so tuning thresholds to page only on real problems is as important as the rotation itself. Reliable external alerting is the foundation, which is why teams pair on-call with continuous monitoring; see how Pulsetic supports this in monitoring for DevOps teams.

See also: Monitoring for DevOps teams

Frequently asked questions

  • What does on-call mean in software engineering?

    It means an engineer is designated, usually on a rotating schedule, to respond to alerts and incidents whenever they occur, including nights and weekends. The on-call person acknowledges the alert, diagnoses the problem, and either resolves it or escalates to someone who can.

  • How does on-call affect MTTR and uptime?

    On-call compresses the time between an alert firing and a human starting to fix the problem, which lowers MTTR (mean time to recovery). Since availability is MTBF / (MTBF + MTTR), faster acknowledgement improves uptime. With 99.9% allowing only about 43 minutes of downtime a month, a slow overnight response can consume most of that budget.

  • What is an escalation policy?

    An escalation policy defines what happens when an alert is not acknowledged in time. Typically the primary on-call is paged first; if they do not respond within a set window, the alert escalates to a secondary on-call, then to a team lead, so no incident is ever left unattended.

  • What is alert fatigue?

    Alert fatigue is when on-call engineers receive so many alerts, especially false or low-priority ones, that they start to ignore or miss them. It is a common failure mode of on-call. The fix is tuning monitoring so only real, actionable problems page a human, and routing the rest to business-hours queues.