Mixpanel Status · History · Incident #3717

RESOLVED

Query API degraded performance

Minor · Started May 14, 2026 · 4:17 PM

Duration

1d 3h 33m
Severity

Minor
Detection lead

—
User reports

—

Summary

Query API degraded performance

# Mixpanel RCA: Transient Data Access Issue, May 14, 2026 ## Summary On Thursday, May 14, 2026 at approximately 2:30 PM PT, a routine but infrequent cleanup operation in Mixpanel's storage system mistakenly removed a portion of production data files in addition to the unused files it was intended to remove. Some customers experienced query errors during the hours that followed. We detected the issue within minutes, deployed mitigations the same evening that returned query success rates and latency to normal, and restored the affected files from backup by 5:15 PM PT on Friday, May 15. Mixpanel's ingestion pipeline was not affected and no event data was lost in transit. ## What happened This incident was triggered by a storage cleanup procedure that runs periodically to remove files no longer referenced by Mixpanel's metadata. The procedure was more involved than usual: it followed a recent enhancement to our file storage strategy that left a set of unused files behind in our storage backend, and addressing them required extending our standard cleanup approach to cover a new code path. As part of executing this extended cleanup, an engineer generated the list of files to delete using a SQL query whose date filter was not strictly earlier than the reference snapshot it was being compared against. As a result, a small set of legitimate production files that had been written in the gap window between the snapshot and the filter date were incorrectly classified as unused and removed. The deletion ran for roughly half an hour before internal alerting caught the resulting query failures and the operation was stopped. The trigger was operator error against an ambiguous runbook, not a defect in the live serving path or in our ingestion pipeline. ## Customer impact Impact unfolded in two phases. The first phase ran from Thursday at approximately 2:30 PM PT until 8:11 PM PT — roughly five and a half hours. During this window, customers across the platform may have seen slower or failed queries when their requests touched files that had been deleted. The breadth and severity varied by project depending on which data each query touched. By 8:11 PM PT, mitigations had fully rolled out — queries automatically retried against an alternate availability zone, and a fallback path was put in place to serve missing files from a backup datastore. After this point, query success rate and latency returned to normal. The second phase lasted from 8:11 PM PT Thursday through approximately 5:15 PM PT Friday, May 15. During this window, fewer than 2% of customers were still affected — specifically, those whose deleted files had not yet been fully restored from backup. The vast majority of these files were recovered by Friday afternoon. A small number of projects \(under 30\) had files that could not be fully recovered from backup, and we are following up with those accounts directly. ## Timeline \(Pacific Time\) * May 14, 2:30 PM — Cleanup operation begins * May 14, 3:11 PM — Internal alerting flags query failures; the cleanup operation is stopped within minutes * May 14, 4:07 PM — Status page banner posted * May 14, 4:45 PM — Mitigation deployed: queries automatically retry against an alternate availability zone * May 14, 7:12 PM — Mitigation deployed: queries fall back to a backup datastore for missing files * May 14, 8:11 PM — Query success rate and latency fully restored to normal levels * May 15, 5:15 PM — File restore from backup complete; status page banner resolved ## Why this happened Several contributing factors lined up. The runbook for this cleanup procedure had ambiguous wording around the ordering and timing of its inputs. It had been recently authored to handle the new file-storage code path and had not gone through a formal review before being used. Our cleanup tooling did not programmatically enforce the safety invariant that the date filter must be strictly before the reference snapshot. That invariant lived only in operator-authored SQL. The extended cleanup was being executed in parallel across two storage layers by two different engineers, which increased the room for error. ## What we're doing to prevent recurrence We have already made or have actively in flight the following changes. We are adding programmatic safeguards to our cleanup tooling so that an input set whose date filter is not safely before the reference snapshot is rejected before any deletion occurs, along with a reconciliation step that flags any production-referenced file before deletion proceeds. Destructive cleanup operations will now run in phased stages, starting with internal projects and pausing for a holding period before any broader execution. Destructive storage operations now require a second engineer to sign off on the exact deletion set and to be present during execution, matching the practice we already follow for database migrations. We have updated the cleanup runbook with explicit guidance on input timing, required safety buffers, and an enforced review process for any runbook covering a destructive operation. Longer term, we are working to eliminate the manual portion of this cleanup procedure entirely and route it through our existing automated cleanup infrastructure, so the class of failure that produced this incident is no longer reachable through human input. ## Closing Reliability and data integrity are foundational to the trust our customers place in Mixpanel, and we recognize the impact this incident had on the teams who rely on us. We are sorry for the disruption. If you have questions about how this incident may have affected a specific project, please reach out to your account team or Mixpanel Support.

Started

May 14, 2026 · 4:17 PM
Resolved

May 15, 2026 · 7:50 PM
Duration

1d 3h 33m
Severity

Minor

Event timeline

How this incident unfolded

◐

Investigating
May 14 · 4:17 PM Mixpanel

We are experiencing degraded performance with our Query API, including increased latency. You may see slow-loading reports, incomplete query results, query errors, or data discrepancies. We appreciate your patience while our engineers work to restore normal functionality. We will post progress updates on our status page. If you have any questions, please contact support.
◆

Identified
May 14 · 4:48 PM Mixpanel

The issue has been identified and a fix is being implemented.
◆

Identified
May 14 · 5:18 PM Mixpanel

We are continuing to work on a fix for this issue.
◆

Identified
May 14 · 5:49 PM Mixpanel

We are continuing to work on a fix for this issue.
◆

Identified
May 14 · 6:49 PM Mixpanel

We are continuing to work on a fix for this issue.
◆

Identified
May 14 · 7:49 PM Mixpanel

We are continuing to work on a fix for this issue.
◆

Identified
May 14 · 8:49 PM Mixpanel

We are continuing to work on a fix for this issue.
◆

Identified
May 15 · 1:18 AM Mixpanel

We are continuing to work on a fix for this issue.
◆

Identified
May 15 · 3:19 AM Mixpanel

We are continuing to work on a fix for this issue.
◆

Identified
May 15 · 5:28 AM Mixpanel

We've continued to make progress on the issue affecting some US projects. Query success rates have returned to normal levels, and the related processing delays have also recovered. A subset of affected projects may still see incomplete data in query results while our recovery process runs. We're actively working on this and will share another update in 2 hours. Impact remains limited to our US region. We do not currently believe any data has been permanently lost.
◆

Identified
May 15 · 7:31 AM Mixpanel

Recovery is progressing well. Query success rates remain at normal levels, and missing data has now been restored for a portion of affected projects. We're continuing the recovery process for the remaining affected projects and currently estimate full recovery within approximately 3–5 hours. Some customers may still see incomplete data in query results until this work is complete.
◆

Identified
May 15 · 9:37 AM Mixpanel

We are continuing to work on resolving this issue. Our current estimate for full recovery is approximately 3–5 hours. Some customers may still see incomplete data in reports until recovery is complete. We will continue to provide updates.
◆

Identified
May 15 · 11:47 AM Mixpanel

Recovery is continuing to progress and query latency has returned to normal levels. Our current estimate for full recovery is approximately 3–4 hours. Some customers may still see incomplete data in reports until recovery is complete. We will continue to provide updates.
◆

Identified
May 15 · 12:56 PM Mixpanel

We are investigating an increase in query latency that appears to be unrelated to the ongoing recovery. You may experience slower-loading reports. Our team is actively looking into the cause and we will provide an update shortly. Recovery is continuing to progress. Our current estimate for full recovery is approximately 3–4 hours. Some customers may still see incomplete data in reports until recovery is complete. We will continue to provide updates.
◆

Identified
May 15 · 2:23 PM Mixpanel

Query latency has returned to normal levels. We are continuing to monitor.
◆

Identified
May 15 · 3:14 PM Mixpanel

Query latency has returned to normal levels and has been resolved. Recovery is continuing to progress, and our current estimate for full recovery is approximately 4–5 hours. Some customers may still see incomplete data in reports until recovery is complete. We will continue to provide updates.
◆

Identified
May 15 · 5:17 PM Mixpanel

Recovery is continuing to progress. Our current estimate for full recovery is approximately 1-2 hours. Some customers may still see incomplete data in reports until recovery is complete. We will continue to provide updates.
◉

Monitoring
May 15 · 6:52 PM Mixpanel

A fix has been implemented and we are monitoring the results.
✓

Resolved
May 15 · 7:50 PM Mixpanel

This incident has been resolved.
✓

Postmortem
May 21 · 2:43 PM Mixpanel

# Mixpanel RCA: Transient Data Access Issue, May 14, 2026 ## Summary On Thursday, May 14, 2026 at approximately 2:30 PM PT, a routine but infrequent cleanup operation in Mixpanel's storage system mistakenly removed a portion of production data files in addition to the unused files it was intended to remove. Some customers experienced query errors during the hours that followed. We detected the issue within minutes, deployed mitigations the same evening that returned query success rates and latency to normal, and restored the affected files from backup by 5:15 PM PT on Friday, May 15. Mixpanel's ingestion pipeline was not affected and no event data was lost in transit. ## What happened This incident was triggered by a storage cleanup procedure that runs periodically to remove files no longer referenced by Mixpanel's metadata. The procedure was more involved than usual: it followed a recent enhancement to our file storage strategy that left a set of unused files behind in our storage backend, and addressing them required extending our standard cleanup approach to cover a new code path. As part of executing this extended cleanup, an engineer generated the list of files to delete using a SQL query whose date filter was not strictly earlier than the reference snapshot it was being compared against. As a result, a small set of legitimate production files that had been written in the gap window between the snapshot and the filter date were incorrectly classified as unused and removed. The deletion ran for roughly half an hour before internal alerting caught the resulting query failures and the operation was stopped. The trigger was operator error against an ambiguous runbook, not a defect in the live serving path or in our ingestion pipeline. ## Customer impact Impact unfolded in two phases. The first phase ran from Thursday at approximately 2:30 PM PT until 8:11 PM PT — roughly five and a half hours. During this window, customers across the platform may have seen slower or failed queries when their requests touched files that had been deleted. The breadth and severity varied by project depending on which data each query touched. By 8:11 PM PT, mitigations had fully rolled out — queries automatically retried against an alternate availability zone, and a fallback path was put in place to serve missing files from a backup datastore. After this point, query success rate and latency returned to normal. The second phase lasted from 8:11 PM PT Thursday through approximately 5:15 PM PT Friday, May 15. During this window, fewer than 2% of customers were still affected — specifically, those whose deleted files had not yet been fully restored from backup. The vast majority of these files were recovered by Friday afternoon. A small number of projects \(under 30\) had files that could not be fully recovered from backup, and we are following up with those accounts directly. ## Timeline \(Pacific Time\) * May 14, 2:30 PM — Cleanup operation begins * May 14, 3:11 PM — Internal alerting flags query failures; the cleanup operation is stopped within minutes * May 14, 4:07 PM — Status page banner posted * May 14, 4:45 PM — Mitigation deployed: queries automatically retry against an alternate availability zone * May 14, 7:12 PM — Mitigation deployed: queries fall back to a backup datastore for missing files * May 14, 8:11 PM — Query success rate and latency fully restored to normal levels * May 15, 5:15 PM — File restore from backup complete; status page banner resolved ## Why this happened Several contributing factors lined up. The runbook for this cleanup procedure had ambiguous wording around the ordering and timing of its inputs. It had been recently authored to handle the new file-storage code path and had not gone through a formal review before being used. Our cleanup tooling did not programmatically enforce the safety invariant that the date filter must be strictly before the reference snapshot. That invariant lived only in operator-authored SQL. The extended cleanup was being executed in parallel across two storage layers by two different engineers, which increased the room for error. ## What we're doing to prevent recurrence We have already made or have actively in flight the following changes. We are adding programmatic safeguards to our cleanup tooling so that an input set whose date filter is not safely before the reference snapshot is rejected before any deletion occurs, along with a reconciliation step that flags any production-referenced file before deletion proceeds. Destructive cleanup operations will now run in phased stages, starting with internal projects and pausing for a holding period before any broader execution. Destructive storage operations now require a second engineer to sign off on the exact deletion set and to be present during execution, matching the practice we already follow for database migrations. We have updated the cleanup runbook with explicit guidance on input timing, required safety buffers, and an enforced review process for any runbook covering a destructive operation. Longer term, we are working to eliminate the manual portion of this cleanup procedure entirely and route it through our existing automated cleanup infrastructure, so the class of failure that produced this incident is no longer reachable through human input. ## Closing Reliability and data integrity are foundational to the trust our customers place in Mixpanel, and we recognize the impact this incident had on the teams who rely on us. We are sorry for the disruption. If you have questions about how this incident may have affected a specific project, please reach out to your account team or Mixpanel Support.

Pattern

Other recent incidents

Browse full incident history →

Get alerted before the next Mixpanel outage.

Pulsetic catches degradations minutes before vendors acknowledge them.

Start monitoring free