Dashboard and application portal were temporarily inaccessible for some users
Incident Report for Svix
Postmortem

Chain of events

  1. At 11:51pm UTC a new deployment changed the certificate to include the wrong CNAMEs.
  2. The team discovered it at 11:52pm UTC.
  3. Found the cause and prepared a fix at 11:57pm UTC
  4. Issue resolved (TLS certificate propagated) at 12:07am UTC.

What went wrong

  1. A change for an upcoming big feature inadvertently changed the certificates used by the servers serving our static content (dashboard and app portal).
  2. Since AWS may cache old certificates for short period of time both our automatic and manual tests were served the good (old) certificates and passed. This happened both in staging and in production.
  3. We have ongoing monitoring on the API endpoints in both staging and production - but for the dashboard and frontend only in production.

Aftermath

We have yet to conclude our full analysis though will draw our conclusions and put additional safety mechanisms in place to ensure this (and similar incidents) never happen again.

Posted Aug 05, 2021 - 08:56 UTC

Resolved
The dashboard and application portal were temporarily inaccessible for ~15 minutes for some users.
There was a configuration issue that was causing wrong TLS certificates to be used.
Posted Aug 05, 2021 - 00:07 UTC