Post-Mortem: US Region Log Ingestion Delay
Date: 2026-03-17
Duration: ~6h 15m (12:00 - 18:15 UTC)
Severity: Medium
Region: AWS us-west-2
Data loss: None
Summary
A gradual log ingestion delay built up in us-west-2 starting at 12:00 UTC, eventually reaching 15 minutes of latency before detection at approximately 15:50 UTC. The delay was caused by insufficient processing capacity. Detection was significantly delayed because the primary alerting metric had been renamed earlier that morning, silently breaking our lag alerts.
Timeline (UTC)
~12:00 - Ingestion latency begins accumulating for customers in AWS us-west-2
Morning (same day) -
A PR is merged that renames the metric backing our primary lag alerts
~15:50 - Secondary synthetic alerts fire after log availability exceeds the 15-minute threshold. Team is paged.
~15:50+ - Additional processing nodes are deployed to drain the backlog
18:15 - All data fully backfilled, ingestion latency returns to normal
Root Cause
Processing capacity in AWS us-west-2 was insufficient to keep up with incoming log volume, causing a slowly growing ingestion delay. This is a condition our primary lag alerts are designed to catch early. However, a metric rename merged that same morning broke those alerts without anyone noticing, leaving only our secondary synthetic checks (which have a 15-minute detection threshold) as a safety net.
Impact
Log data in AWS us-west-2 was delayed by up to 15 minutes
No data was lost; all logs were eventually processed and available