Authors: SRE, DE, Support team
There was a performance degradation that made the application unstable from May 05 13:39 (UTC) to May 05 14:37 (UTC).
Minor - The performance degradation on the application happened from May 05 13:39 (UTC) to May 05 14:37 (UTC) and was perceived mostly as latency and error messages.
The identified root cause of this incident was defined as an unexpected consequence of a recent deploy that caused one of the system’s structures to malfunction.
Detection and resolution:
The issue was first detected by our internal monitoring systems that triggered automated alerts and informed the team that our users may have been experiencing latency in the application.
The support team has also received messages from both internal and external customers informing that they were experiencing unexpected behaviors.
After the root cause was identified, the team performed a rollback of the deployed code which resolved the latency and restored the platform’s performance.
Action plan: Preventive action items
1. Internally restructure old asynchronous processings to mitigate the risk of having another incident . Due date: 09-01-2020
2. Compile all the information processed by our monitoring tools to properly assess and pinpoint the cause of the performance degradation. Due date: 05-22-2020
3. Implement an upgraded and more complete post-deploy checklist. Due date: 05-22-2020