Dash0

Report a problemSubscribe to updates
Powered by

Go to https://www.dash0.com

Privacy policy

·

Terms of service
Write-up
Log processing delayed in AWS us-west-2 region
Partial outage
View the incident

Post-Mortem: US Region Log Ingestion Delay

  • Date: 2026-03-17

  • Duration: ~6h 15m (12:00 - 18:15 UTC)

  • Severity: Medium

  • Region: AWS us-west-2

  • Data loss: None

Summary

A gradual log ingestion delay built up in us-west-2 starting at 12:00 UTC, eventually reaching 15 minutes of latency before detection at approximately 15:50 UTC. The delay was caused by insufficient processing capacity. Detection was significantly delayed because the primary alerting metric had been renamed earlier that morning, silently breaking our lag alerts.

Timeline (UTC)

  • ~12:00 - Ingestion latency begins accumulating for customers in AWS us-west-2

  • Morning (same day) -

    A PR is merged that renames the metric backing our primary lag alerts

  • ~15:50 - Secondary synthetic alerts fire after log availability exceeds the 15-minute threshold. Team is paged.

  • ~15:50+ - Additional processing nodes are deployed to drain the backlog

  • 18:15 - All data fully backfilled, ingestion latency returns to normal

Root Cause

Processing capacity in AWS us-west-2 was insufficient to keep up with incoming log volume, causing a slowly growing ingestion delay. This is a condition our primary lag alerts are designed to catch early. However, a metric rename merged that same morning broke those alerts without anyone noticing, leaving only our secondary synthetic checks (which have a 15-minute detection threshold) as a safety net.

Impact

  • Log data in AWS us-west-2 was delayed by up to 15 minutes

  • No data was lost; all logs were eventually processed and available