Slow responses from the TwentyThree platform

Incident Report for TwentyThree

Postmortem

On the 9th of February between 09:39 and 09:49 UTC and later between 09:59 and 10:01 UTC the TwentyThree platform would reject a large portion of incoming traffic and it would respond with latency going up to 30 seconds on the accepted traffic.
The issue was caused by a large amount of traffic invoked by a bogus configuration of a 3rd party tool. During the issue we received the same amount of traffic during 10 minutes as we would normally receive over the span of a few days.

During the incident:

VMP management would not be available or the latency would be very high
Some TwentyThree-hosted webinar Hubs would not be accessible and some embedded video players would not be accessible, based on the status of CDN cache
Ongoing webinars kept streaming, but people would not be able to join the webinars for the duration of the incident

Timeline:

09:39 UTC: we noticed unusually high amount of traffic
09:40 UTC: we provisioned more resources to handle incoming traffic
09:41 UTC: we identified the cause, and started working on an application-layer remediation while trying to limit the incoming traffic
09:47 UTC: we started recovering
09:49 UTC: we fully recovered
09:59 UTC: we saw another uptick in incoming traffic, causing the platform to not respond
10:00 UTC: we started deploying application-layer remediation
10:01 UTC: with application layer fully deployed we recovered fully

To avoid the same issue in the future:

We are putting stricter rate limiting in place
We are continuing our effort to introduce asynchronous behaviour to parts of our system that can put a lot of pressure on the TwentyThree platform when they receive a lot of traffic

Posted Feb 12, 2024 - 16:03 CET

Resolved

This incident has been resolved.

Posted Feb 09, 2024 - 11:30 CET

Update

The issue has been resolved. We will provide more information about the incident in the post-mortem section.

Posted Feb 09, 2024 - 11:30 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 09, 2024 - 11:08 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 09, 2024 - 11:02 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 09, 2024 - 10:52 CET

Investigating

We are investigating an ongoing issue causing the TwentyThree platform to respond with more delay than usual.

We will update this incident once we know more.

Posted Feb 09, 2024 - 10:45 CET

This incident affected: TwentyThree Platform, Analytics, Video Processing, and Webinars.