Most routes 5xx in the US region [resolved]

Incident Report for Svix

Postmortem

We are still investigating, but here is our more complete update regarding this incident:

https://www.svix.com/blog/we-had-a-partial-outage/

Posted Mar 12, 2023 - 21:37 UTC

Resolved

The issue has been resolved, though we are still trying to locate the root cause.
There were no deploys today (Saturday), so it's not due to any change on our end, and the activity doesn't look unusual.

For whatever reason our API workers went from ~10% utilization (normal) to 100% in a short span of time. We are investigating with AWS.

Update: after investigating with AWS for a few hours, neither they nor us are able to understand the reason for the memory usage jump though their and our metrics don't indicate any change in load, underlying systems, or anything like that. They've indicated that OOM can happen even if there hasn't been any indication to that in the AWS metrics.

We are still investigating.

Posted Mar 11, 2023 - 21:23 UTC

Monitoring

The service is back up. We are still monitoring, but everything is operational.

Posted Mar 11, 2023 - 20:42 UTC

Update

We are still investigating, but we put mitigating changes in place.

Posted Mar 11, 2023 - 20:35 UTC

Update

Still investigating.

Posted Mar 11, 2023 - 20:27 UTC

Investigating

We are currently investigating this issue.

Posted Mar 11, 2023 - 19:38 UTC

This incident affected: API.