What Is Incident Management?

Incident management is the structured process of detecting, triaging, responding to, resolving, and learning from any unplanned event that disrupts a service or degrades its quality. It turns a chaotic incident into a repeatable sequence with defined owners, timers, and follow-up.

The goal is to restore normal service as fast as possible while keeping users and stakeholders informed. ITIL draws a line that matters here: an incident is an unplanned interruption or reduction in service quality, while a problem is the underlying cause of one or more incidents. Incident management restores service now; problem management removes the root cause so the incident does not recur.

The incident lifecycle

Most teams run incidents through five stages, each with a clock attached so performance can be measured later:

  • Detect: an automated check, alert, or user report surfaces the issue. Detection at a 1-minute check interval beats waiting for a customer email by 20 to 30 minutes.
  • Triage and assign severity: confirm impact and tag a severity level (for example SEV1 for a full outage, SEV3 for a minor degradation) to set the response pace.
  • Respond and mitigate: page the on-call engineer, form the response, and apply a fix or workaround to stop user impact.
  • Resolve: service is fully restored and the incident is closed.
  • Postmortem and follow-up: run a blameless postmortem within a few business days and track the action items to completion.

Metrics and roles

Three numbers measure how well the process works. MTTA (mean time to acknowledge) tracks how quickly someone owns an alert, often targeted under 5 minutes for high-severity pages. MTTR (mean time to recovery) measures end-to-end recovery. MTBF (mean time between failures) tracks reliability over the long run; an MTTR of 30 minutes against an MTBF of 2,000 hours yields roughly 99.975% availability.

Clear roles prevent confusion during the response. The incident commander (also called incident lead) owns decisions and coordination, while a separate communications lead handles updates so the technical responders stay focused on the fix. During the event, posting timely updates to a public status page sets expectations and cuts inbound support volume; with Pulsetic you can publish those updates from the same place you monitor.

See also: Pulsetic status pages

Frequently asked questions

  • What is the difference between an incident and a problem?

    An incident is an unplanned interruption to a service or a drop in its quality, such as a checkout page returning 500 errors. A problem is the underlying cause of one or more incidents, like a memory leak that triggers crashes every few hours. Incident management restores service quickly, often within minutes, while problem management investigates and eliminates the root cause so the incident does not happen again.

  • What metrics measure incident management performance?

    The core three are MTTA (mean time to acknowledge), MTTR (mean time to recovery), and MTBF (mean time between failures). Many teams target an MTTA under 5 minutes for SEV1 incidents and review MTTR trends monthly. Together these show how fast a team reacts, how fast it recovers, and how often failures occur.

  • Who runs an incident?

    The incident commander, sometimes called the incident lead, owns the response and makes the decisions, even if that person is not the most senior engineer present. A communications lead manages stakeholder and status-page updates separately so responders can focus on the fix. For a SEV1 outage a team might also pull in 3 to 5 subject-matter engineers, but only one commander stays in charge.

  • Why communicate on a status page during an incident?

    A public status page gives customers a single source of truth, which reduces duplicate support tickets and the appearance of silence. Teams that post an update within the first 5 to 10 minutes of a major outage typically see a sharp drop in inbound contacts. It also creates a timestamped record that feeds directly into the postmortem timeline.