What Is an Incident?

An incident is an unplanned event that disrupts or degrades a service below its expected level, from a full outage to slow or partial failures. It is managed through a defined lifecycle of detection, response, mitigation, and resolution, and its cumulative duration is what counts as downtime against an uptime target.

An incident is the unit of reliability work that everything else in operations is organized around. It begins the moment a service stops meeting its expected behavior (a checkout that fails, an API returning 500s, a page that times out) and ends when normal operation is restored and confirmed. Not every alert is an incident and not every incident is a full outage: a partial slowdown that breaches a latency threshold counts, while a single transient blip that self-heals usually does not. Teams classify incidents by severity (often Sev1 through Sev3 or P1 through P4) so the response matches the impact, with the most severe reserved for a customer-facing outage or data loss.

The incident lifecycle

An incident moves through a predictable sequence, and shortening any stage reduces total impact. The stages map closely to the live updates a status page publishes during an event:

  • Detection: a monitoring check or alert flags the problem. Detecting from outside your infrastructure, the way a visitor connects, is what catches issues a healthy-looking server would miss.
  • Response: the alert routes to whoever is on call, who acknowledges and begins investigating.
  • Mitigation: a stopgap (rollback, failover, traffic shedding) restores service before the root cause is fully understood.
  • Resolution: the underlying cause is fixed and normal operation is verified, closing the incident.

The clock from detection to resolution is what MTTR (mean time to recovery) averages across incidents, and it feeds directly into availability via Availability = MTBF / (MTBF + MTTR). Because detection is the first stage, you cannot begin fixing an incident you do not yet know about, which is why fast external alerting usually moves MTTR more than a faster fix does.

Why incidents matter for uptime

Incidents are how an abstract reliability target becomes real cost. The cumulative duration of all incidents over a period is your downtime, and a 99.9% monthly uptime target permits only about 43 minutes of it, so a single 30-minute incident can consume most of the budget at once. Framed against an SLO, that allowance is the error budget (100% minus the SLO), and a string of incidents that exhausts it is the signal to freeze releases and prioritize reliability. After a significant incident, the closing step is a blameless postmortem that records the timeline and the action items so the same failure does not recur. For how each "nine" translates into real minutes, see what 99.9% uptime actually means, and use status pages to communicate an incident to users while it is in progress.

See also: Pulsetic status pages

Frequently asked questions

  • What is the difference between an incident and an outage?

    An outage is one kind of incident in which a service is fully unavailable. An incident is the broader term: it covers any unplanned event that disrupts or degrades a service, including partial failures, slow responses that breach a latency threshold, or a broken feature, even when the rest of the service stays up.

  • What are the stages of an incident lifecycle?

    Most teams model an incident as detection, response, mitigation, and resolution. Detection flags the problem, response routes it to an on-call responder, mitigation restores service with a stopgap such as a rollback or failover, and resolution fixes the root cause and verifies that normal operation has returned.

  • How do incidents affect uptime and SLAs?

    The total duration of all incidents over a period is your downtime, which is subtracted from uptime. Because 99.9% monthly uptime allows only about 43 minutes of downtime, a single incident can consume most of the month's error budget (100% minus the SLO) and put an SLA at risk of breach.

  • What happens after an incident is resolved?

    For significant incidents, teams run a blameless postmortem that reconstructs the timeline, identifies the root cause and contributing factors, and assigns follow-up actions to prevent recurrence. Recording the incident also feeds reliability metrics such as MTTR and the rolling history shown on a status page.