What Is a Postmortem?

A postmortem is a structured, blameless retrospective written after an incident to record what happened, how long it lasted, why it happened, and the concrete follow-up actions that will stop it recurring. Its purpose is organizational learning, not assigning blame to individuals.

A postmortem (also called an incident retrospective or post-incident review) is the document a team produces once an outage is resolved and the on-call rotation has stood down. It turns a stressful event into durable institutional knowledge. A good postmortem is fact-first and judgment-light: it reconstructs the sequence of events from real evidence (alerts, logs, dashboards, chat transcripts) so that anyone who reads it later understands not just that the service broke, but why the system allowed it to break and what was changed as a result.

Why postmortems are blameless

"Blameless" means the analysis focuses on systems, processes, and missing safeguards rather than on the person who ran the command or shipped the change. The reasoning is practical: in a blameless culture, engineers report what they actually did and saw, which produces an accurate timeline. In a blame culture, people hide details to protect themselves, and the root cause stays buried. The test of a blameless postmortem is whether the people closest to the incident would feel safe writing it themselves.

What a postmortem contains

Most templates converge on the same core sections:

  • A short summary: what broke, who was affected, and the customer impact.
  • A timeline with timestamps, from first symptom to detection, mitigation, and full recovery.
  • Impact metrics, such as duration and how much of the error budget it consumed.
  • Root cause analysis, often using "5 whys" to move past the surface trigger.
  • Action items, each with an owner and a due date, tracked to completion.
  • What went well, including any safeguard that limited the damage.

The timeline is where reliability math becomes concrete. Suppose a checkout API returns 5xx errors for 50 minutes. If alerting fired at minute 4, an engineer acknowledged at minute 9, and a rollback restored service at minute 50, then detection took 4 minutes and the recovery time (a contributor to MTTR) was about 46 minutes. Against a 99.9% monthly target, which allows roughly 43 minutes of downtime, that single event slightly overran the entire month's error budget. Framing impact against the budget (error budget = 100% minus the SLO) keeps action items honest about urgency.

Postmortems close the operational loop that begins with on-call detection and response. The faster and more reliably you detect outages, the shorter the timeline you have to explain, which is why teams pair the discipline of writing postmortems with continuous external DevOps monitoring and instrument the user-facing side with real user monitoring. The deeper lesson of any postmortem is that prevention beats heroics: the value is realized only when its action items actually ship.

See also: Monitoring for DevOps teams

Frequently asked questions

  • What is the difference between a postmortem and an incident?

    An incident is the event itself: a discrete period when a service is degraded or down. A postmortem is the document written afterward that analyzes that incident, capturing the timeline, root cause, and follow-up actions so the same failure does not happen again.

  • Why are postmortems called blameless?

    Blameless means the review focuses on systems and processes rather than punishing the individual involved. This produces honest, detailed accounts, because engineers will only describe exactly what happened if they trust they will not be penalized for it, and an accurate account is what reveals the true root cause.

  • When should you write a postmortem?

    Common triggers are any customer-facing outage, any incident that breached or threatened an SLA, a near-miss that could easily have caused an outage, or any event that took an unusually long time to detect or recover from. Many teams set a clear threshold, such as any incident over 5 minutes of user impact, so the decision is not subjective.

  • What should a postmortem produce?

    Its most important output is a list of action items, each with a named owner and a due date, that remove the conditions that allowed the incident. A postmortem whose action items are never completed has not reduced future risk; the document is only valuable once the fixes it recommends are shipped and verified.