On the 9th of February between 09:39 and 09:49 UTC and later between 09:59 and 10:01 UTC the TwentyThree platform would reject a large portion of incoming traffic and it would respond with latency going up to 30 seconds on the accepted traffic.
The issue was caused by a large amount of traffic invoked by a bogus configuration of a 3rd party tool. During the issue we received the same amount of traffic during 10 minutes as we would normally receive over the span of a few days.
During the incident:
- VMP management would not be available or the latency would be very high
- Some TwentyThree-hosted webinar Hubs would not be accessible and some embedded video players would not be accessible, based on the status of CDN cache
- Ongoing webinars kept streaming, but people would not be able to join the webinars for the duration of the incident
Timeline:
- 09:39 UTC: we noticed unusually high amount of traffic
- 09:40 UTC: we provisioned more resources to handle incoming traffic
- 09:41 UTC: we identified the cause, and started working on an application-layer remediation while trying to limit the incoming traffic
- 09:47 UTC: we started recovering
- 09:49 UTC: we fully recovered
- 09:59 UTC: we saw another uptick in incoming traffic, causing the platform to not respond
- 10:00 UTC: we started deploying application-layer remediation
- 10:01 UTC: with application layer fully deployed we recovered fully
To avoid the same issue in the future:
- We are putting stricter rate limiting in place
- We are continuing our effort to introduce asynchronous behaviour to parts of our system that can put a lot of pressure on the TwentyThree platform when they receive a lot of traffic