Authors: SRE and Support Teams
Date: 2020-05-06
Status: Resolved
Summary:
There was a platform outage that prevented users from accessing the web application from May 06 18:10 (UTC) to May 06 18:30 (UTC).
Impact:
Major - All Pipefy users were unable to access both the web application from May 06 18:10 (UTC) to May 06 18:30 (UTC). The outage has affected the entire user base.
Root Cause:
The identified root cause of this incident was defined as a modification that was made on Amazon Security Group Cluster. Due to a fail in properly versioning the change, it was overwritten on sync.
Detection and resolution:
The issue was first detected by our internal monitoring systems that triggered automated alerts and informed the team that our users were unable to properly access the application.
Messages from both internal and external customers were also received while the team had already begun investigating.
Once the root cause was identified, the issue was solved by manually inserting (and versioning) the new rule in the Security Group, which restored access and ensured the application was working properly.
Action plan: Preventive action items
1. Implementation of additional preventive alert structures to identify possible inconsistencies before they affect the users. Due date: 05-22-2020