What Is MTTR (Mean Time to Recovery)?

MTTR (Mean Time to Recovery, sometimes Repair) is the average time it takes to restore a service after a failure, measured from the moment the failure begins to the moment normal operation returns. It is a core reliability metric, and lower is better.

MTTR is calculated by adding up the recovery time of every incident over a period and dividing by the number of incidents. If a service had three outages last month lasting 20, 40 and 30 minutes, its MTTR is (20 + 40 + 30) / 3 = 30 minutes. The figure captures the full repair cycle: detecting the failure, alerting the right people, diagnosing the cause, applying a fix and verifying that service is back.

Because MTTR averages every incident, a single drawn-out outage can dominate the number, while many fast recoveries pull it down. It is most useful read alongside how often failures happen, measured by MTBF, since the two together determine total downtime and therefore uptime.

How MTTR affects availability

Availability is a direct function of how often a system breaks and how quickly it recovers, expressed as Availability = MTBF / (MTBF + MTTR). A service that runs 200 hours between failures and takes 1 hour to recover is available 200 / 201, or about 99.5% of the time. Halve the MTTR to 30 minutes and availability climbs to 200 / 200.5, roughly 99.75%. Improving recovery speed lifts the same headline percentage that uptime targets and SLAs are written against, and at high tiers the margins are tight: 99.9% allows only about 43 minutes of downtime a month. For more on what those nines mean, see what 99.9% uptime actually means.

How to lower MTTR

MTTR breaks into stages, and the largest savings usually come from the earliest one. You cannot begin fixing an outage until you know it exists, so shrinking detection time often moves the number more than faster fixes do. Common levers include:

  • Detect faster with frequent checks from outside your own infrastructure, so failures surface the way a visitor would experience them.
  • Alert the right person immediately through on-call routing, instead of waiting for a customer to report the problem.
  • Shorten diagnosis with clear runbooks, good logging and dashboards that point to the likely cause.
  • Speed up recovery with rollbacks, failover and rehearsed incident procedures so the fix itself is routine.

Fast external alerting is the foundation for all of this, which is why teams pair incident response with continuous monitoring. See how Pulsetic supports this in monitoring for DevOps teams.

See also: Monitoring for DevOps teams

Frequently asked questions

  • What is the difference between MTTR and MTBF?

    MTTR measures how long it takes to recover once a failure has occurred, while MTBF measures how long a system runs between failures. MTTR is about recovery speed, MTBF is about failure frequency, and together they determine overall availability.

  • Does MTTR stand for recovery or repair?

    The acronym is used for both Mean Time to Recovery and Mean Time to Repair, and in everyday monitoring they are often treated as the same number. Strictly, recovery covers the whole incident from detection to restored service, while repair refers only to the hands-on fixing time, so it helps to state which one you mean.

  • What is a good MTTR?

    There is no universal target, because it depends on the service and its uptime commitment. The useful approach is to track MTTR over time and drive it down, since a lower MTTR for the same number of failures always means less total downtime.

  • How do you reduce MTTR the fastest?

    Usually by cutting detection time. You cannot start fixing an outage until you know about it, so monitoring that checks from outside your infrastructure and alerts immediately tends to shorten MTTR more than speeding up the repair itself.