Service Degradation Post-Mortem
Authors: SRE, Security and Support Teams
Date: 2019-12-18
Status: Resolved
Summary: Web and mobile applications outage
Impact: Critical - Pipefy users were unable to access both the web and mobile application from 12:43 p.m. UTC to 2:28 p.m. UTC.
Root Cause:
The identified root cause of this incident was defined as an internal human-error. During a security maintenance routine, an employee mistakenly deactivated some routing settings between our services.
The deactivated settings created an impact stopping information from flowing as expected which ultimately caused the platform outage.
Even though the platform was not accessible to the users, all internal services such as automation rules, email templates and cards created via email were working normally.
The outage only influenced the access to the platform by external users (via web and mobile app access), internally the systems kept on running.
Detection and resolution:
The issue was first detected by one of our internal monitoring systems that triggered an outage alarm.
At first a DNS issue was identified as a possible cause but, after thorough investigation and troubleshooting, the technical team identified that there was a communication failure between two different networks that influence the access to our applications.
Once the root cause was isolated, the team worked on recreating the information routing settings that were deactivated during the maintenance and access was restored.
Action plan: Preventive action items
1. Review and improve the existing network layer alerts within the production environment on Amazon Web Services. Due date: 01/30/2020
2. Review the existing Disaster Recovery documentation and assess the possibility of implementing parallel preset environments. Due date: 02/20/2020
3. Enforce internal policies to notify the internal service owners before and after each change that may affect the production environment; Due date: 01/20/2020
4. Plan a long-term fault tolerance solution to be implemented gradually within the next year. Due date: 05/02/2020.