Uptime & Monitoring Glossary: Key Terms Explained

Alerting

Alerting is the automatic process of notifying the right people the moment a monitored check fails or a metric crosses a defined threshold, so problems are caught and routed to an on-call responder before users start reporting them.

API Monitoring

API monitoring is the continuous, automated checking of REST and GraphQL endpoints for availability, correctness, and speed.

Availability

Availability is the share of time a system is operational and usable, expressed as a percentage over a defined window.

CLS (Cumulative Layout Shift)

CLS (Cumulative Layout Shift) is the Core Web Vital that measures visual stability: how much a page's content unexpectedly moves while it loads.

Core Web Vitals

Core Web Vitals are Google's three field metrics for real-world page experience: LCP (loading), INP (responsiveness) and CLS (visual stability).

DNS Monitoring

DNS monitoring is the continuous, external checking that a domain's DNS records resolve correctly, return the expected values, and respond fast, catching resolution failures, slow lookups, and unexpected record changes before they take an otherwise healthy site offline.

Domain Expiration Monitoring

Domain expiration monitoring is the practice of tracking a domain name's registration expiry date through WHOIS or RDAP so the domain is renewed before it lapses.

Downtime

Downtime is any period during which a website or service is unavailable or not working correctly for its users.

Error Budget

An error budget is the amount of unreliability a Service Level Objective (SLO) permits, equal to 100% minus the SLO target.

Escalation Policy

An escalation policy is a predefined set of rules that decides who is notified about an alert and how it advances to the next person or tier if no one acknowledges it within a set time.

Health Check

A health check is an automated probe of an endpoint that reports whether a service is up and working correctly.

Heartbeat Monitoring

Heartbeat monitoring watches for a regular signal (a "heartbeat") sent by a job or service and raises an alert when that signal does not arrive on time.

Incident

An incident is an unplanned event that disrupts or degrades a service below its expected level, from a full outage to slow or partial failures.

Incident Management

Incident management is the structured process of detecting, triaging, responding to, resolving, and learning from any unplanned event that disrupts a service or degrades its quality.

Incident Severity

Incident severity is a classification scheme that ranks an incident by its business impact, so teams know how urgently to respond, who to wake up, and how widely to communicate.

INP (Interaction to Next Paint)

INP (Interaction to Next Paint) is the Core Web Vital that measures how quickly a page responds to user interactions, recording the delay between an action such as a tap or click and the next frame the browser paints in response.

Latency

Latency is the delay between a request being sent and the response beginning to arrive, usually measured in milliseconds.

LCP (Largest Contentful Paint)

LCP (Largest Contentful Paint) is the Core Web Vital that measures loading performance: the time from when a page starts loading until its largest visible element (usually a hero image, video, or heading block) is rendered.

MTBF (Mean Time Between Failures)

MTBF (Mean Time Between Failures) is the average amount of time a system operates normally between one failure and the next.

MTTA (Mean Time to Acknowledge)

MTTA (mean time to acknowledge) is the average time between an alert firing and a responder confirming they have seen it and are taking ownership.

MTTR (Mean Time to Recovery)

MTTR (Mean Time to Recovery, sometimes Repair) is the average time it takes to restore a service after a failure, measured from the moment the failure begins to the moment normal operation returns.

On-Call

On-call is the practice of designating engineers, on a rotating schedule, to respond to alerts and incidents at any hour, including nights and weekends.

Page Load Time

Page load time is the total elapsed time for a web page to download, render, and become usable, measured from navigation start to the point the page is fully painted and interactive.

Ping Monitoring

Ping monitoring sends ICMP echo requests to a host and waits for echo replies to confirm reachability and measure response speed.

Postmortem

A postmortem is a structured, blameless retrospective written after an incident to record what happened, how long it lasted, why it happened, and the concrete follow-up actions that will stop it recurring.

Real User Monitoring (RUM)

Real user monitoring (RUM) measures the experience of actual visitors by collecting performance data from their real browsing sessions.

Response Time

Response time is the elapsed interval from the moment a client sends a request until it receives the response, usually measured for a single request.

SLA (Service Level Agreement)

A Service Level Agreement (SLA) is a formal, usually contractual commitment between a provider and its customers that defines the expected level of service, such as 99.9% uptime, and the remedies (like service credits) that apply if that level is not met.

Uptime & monitoring glossary

Browse all terms