Slow responses from the TwentyThree platform
Incident Report for TwentyThree
Postmortem

On the 9th of February between 09:39 and 09:49 UTC and later between 09:59 and 10:01 UTC the TwentyThree platform would reject a large portion of incoming traffic and it would respond with latency going up to 30 seconds on the accepted traffic.
The issue was caused by a large amount of traffic invoked by a bogus configuration of a 3rd party tool. During the issue we received the same amount of traffic during 10 minutes as we would normally receive over the span of a few days.

During the incident:

  • VMP management would not be available or the latency would be very high
  • Some TwentyThree-hosted webinar Hubs would not be accessible and some embedded video players would not be accessible, based on the status of CDN cache
  • Ongoing webinars kept streaming, but people would not be able to join the webinars for the duration of the incident

Timeline:

  • 09:39 UTC: we noticed unusually high amount of traffic
  • 09:40 UTC: we provisioned more resources to handle incoming traffic
  • 09:41 UTC: we identified the cause, and started working on an application-layer remediation while trying to limit the incoming traffic
  • 09:47 UTC: we started recovering
  • 09:49 UTC: we fully recovered
  • 09:59 UTC: we saw another uptick in incoming traffic, causing the platform to not respond
  • 10:00 UTC: we started deploying application-layer remediation
  • 10:01 UTC: with application layer fully deployed we recovered fully

To avoid the same issue in the future:

  • We are putting stricter rate limiting in place
  • We are continuing our effort to introduce asynchronous behaviour to parts of our system that can put a lot of pressure on the TwentyThree platform when they receive a lot of traffic
Posted Feb 12, 2024 - 16:03 CET

Resolved
This incident has been resolved.
Posted Feb 09, 2024 - 11:30 CET
Update
The issue has been resolved. We will provide more information about the incident in the post-mortem section.
Posted Feb 09, 2024 - 11:30 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 09, 2024 - 11:08 CET
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 09, 2024 - 11:02 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 09, 2024 - 10:52 CET
Investigating
We are investigating an ongoing issue causing the TwentyThree platform to respond with more delay than usual.

We will update this incident once we know more.
Posted Feb 09, 2024 - 10:45 CET
This incident affected: TwentyThree Platform, Analytics, Video Processing, and Webinars.