Some rate-limited messages were significantly delayed

Incident Report for Svix

Resolved

We have experienced an issue that caused some messages on the platform to be significantly delayed before delivery. Only a small fraction of our customers and an even smaller fraction of messages are affected. There is NO ACTION REQUIRED on your end; the issue is already resolved.

Before I go into further details I want to apologize to those that are affected by this. We know you rely on us for reliable webhook delivery, and we don't take this responsibility lightly. We take extensive measures in order to prevent issues from ever happening, though unfortunately this was caused by a rare race condition, which is why both our manual reviews and testing as well as automated testing didn't catch it. Additionally, since rate-limited (throttled) messages are delayed (by definition) they are excluded from our usual delivery latency metrics and alerting (as they should be delayed). So we weren't alerted either.

What is rate-limiting (throttling)? While Svix can handle however many messages you send us, your customers' endpoints may not be able to. In order to avoid overloading your customer endpoints, they can define a maximum delivery rate. If they are sent more messages a second than they can handle, Svix will buffer the requests and send it to them at the rate they set. You can read more about it in the rate-limiting (throttling) docs.
We recently made a few improvements to the rate-limiting mechanism, and the issue was related to that. Because the issue was related to rate-limiting it only affected endpoints that had rate-limiting enabled (either by you or your customers). In addition to requiring rate-limiting to be set and enabled, it was only triggered when rate limiting was actually hit (e.g. if you had a rate limit of 100 and 50 messages were sent, there was no rate-limiting so the bug couldn't trigger), and when there was backpressure on the rate-limit (so going above the limit was sustained). In addition to that, it also required AWS SQS to suffer from increased latency (happens on occasion). The race condition that was triggered there, made it so that some messages were not immediately scheduled for an additional trigger (after the throttling ends) which was what was causing the delayed invocation.

The issue was first introduced when the new rate-limiting mechanism was rolled to production. Following a testing period in staging, we decided it was ready for production use. We then enabled it for 0.1% of the traffic on the 20th of November, ramping up all the way to 10% of the traffic on the 16th of December. We deployed a fix for the core issue on the 16th of December. The fix was for a benign symptom that was caused by the same core issue. At that stage, however, it was not clear messages rate-limiting could cause delays to such an extent so we didn't proactively search for affected messages. This is why alerting affected customers was delayed.

We are very proud of our reliability track record, and we know we fell short of it in this scenario. As mentioned above, we have many measures in place to help us prevent issues like these from ever happening, though in this case they were insufficient due to the rare scenario required to trigger this race condition.

All of this doesn't change the fact that your service was adversely affected by this, and I'm extremely sorry about that. Please let me know if you have any questions or if there's anything else I can help you with.

–
Tom

Posted Dec 16, 2024 - 05:00 UTC