Log processing delayed in AWS us-west-2 region

Write-up

Post-Mortem: US Region Log Ingestion Delay

Date: 2026-03-17
Duration: ~6h 15m (12:00 - 18:15 UTC)
Severity: Medium
Region: AWS us-west-2
Data loss: None

Summary

A gradual log ingestion delay built up in us-west-2 starting at 12:00 UTC, eventually reaching 15 minutes of latency before detection at approximately 15:50 UTC. The delay was caused by insufficient processing capacity. Detection was significantly delayed because the primary alerting metric had been renamed earlier that morning, silently breaking our lag alerts.

Timeline (UTC)

~12:00 - Ingestion latency begins accumulating for customers in AWS us-west-2
Morning (same day) -
A PR is merged that renames the metric backing our primary lag alerts
~15:50 - Secondary synthetic alerts fire after log availability exceeds the 15-minute threshold. Team is paged.
~15:50+ - Additional processing nodes are deployed to drain the backlog
18:15 - All data fully backfilled, ingestion latency returns to normal

Root Cause

Processing capacity in AWS us-west-2 was insufficient to keep up with incoming log volume, causing a slowly growing ingestion delay. This is a condition our primary lag alerts are designed to catch early. However, a metric rename merged that same morning broke those alerts without anyone noticing, leaving only our secondary synthetic checks (which have a 15-minute detection threshold) as a safety net.

Impact

Log data in AWS us-west-2 was delayed by up to 15 minutes
No data was lost; all logs were eventually processed and available

Go to https://www.dash0.com

Write-up

Log processing delayed in AWS us-west-2 region

Partial outage

View the incident

Post-Mortem: US Region Log Ingestion Delay

Date: 2026-03-17
Duration: ~6h 15m (12:00 - 18:15 UTC)
Severity: Medium
Region: AWS us-west-2
Data loss: None

Summary

Timeline (UTC)

~12:00 - Ingestion latency begins accumulating for customers in AWS us-west-2
Morning (same day) -
A PR is merged that renames the metric backing our primary lag alerts
~15:50 - Secondary synthetic alerts fire after log availability exceeds the 15-minute threshold. Team is paged.
~15:50+ - Additional processing nodes are deployed to drain the backlog
18:15 - All data fully backfilled, ingestion latency returns to normal

Root Cause

Impact

Log data in AWS us-west-2 was delayed by up to 15 minutes
No data was lost; all logs were eventually processed and available