The issue has been resolved, though we are still trying to locate the root cause. There were no deploys today (Saturday), so it's not due to any change on our end, and the activity doesn't look unusual.
For whatever reason our API workers went from ~10% utilization (normal) to 100% in a short span of time. We are investigating with AWS.
Update: after investigating with AWS for a few hours, neither they nor us are able to understand the reason for the memory usage jump though their and our metrics don't indicate any change in load, underlying systems, or anything like that. They've indicated that OOM can happen even if there hasn't been any indication to that in the AWS metrics.
We are still investigating.
Posted Mar 11, 2023 - 21:23 UTC
The service is back up. We are still monitoring, but everything is operational.
Posted Mar 11, 2023 - 20:42 UTC
We are still investigating, but we put mitigating changes in place.