Pipefy is currently unavailable

Incident Report for Pipefy

Postmortem

Service Degradation Post-Mortem

‌

Authors: SRE, Security and Support Teams

Date: 2019-12-18

Status: Resolved

Summary: Web and mobile applications outage

Impact: Critical - Pipefy users were unable to access both the web and mobile application from 12:43 p.m. UTC to 2:28 p.m. UTC.

‌

Root Cause:

The identified root cause of this incident was defined as an internal human-error. During a security maintenance routine, an employee mistakenly deactivated some routing settings between our services.

The deactivated settings created an impact stopping information from flowing as expected which ultimately caused the platform outage.

Even though the platform was not accessible to the users, all internal services such as automation rules, email templates and cards created via email were working normally.

The outage only influenced the access to the platform by external users (via web and mobile app access), internally the systems kept on running.

‌

Detection and resolution:

The issue was first detected by one of our internal monitoring systems that triggered an outage alarm.

At first a DNS issue was identified as a possible cause but, after thorough investigation and troubleshooting, the technical team identified that there was a communication failure between two different networks that influence the access to our applications.

Once the root cause was isolated, the team worked on recreating the information routing settings that were deactivated during the maintenance and access was restored.

‌

Action plan: Preventive action items

1. Review and improve the existing network layer alerts within the production environment on Amazon Web Services. Due date: 01/30/2020

2. Review the existing Disaster Recovery documentation and assess the possibility of implementing parallel preset environments. Due date: 02/20/2020

3. Enforce internal policies to notify the internal service owners before and after each change that may affect the production environment; Due date: 01/20/2020

4. Plan a long-term fault tolerance solution to be implemented gradually within the next year. Due date: 05/02/2020.

Posted Dec 19, 2019 - 13:56 UTC

Resolved

The unavailability has been fixed and the access to Pipefy’s web and mobile application has been restored.
As soon as the preventive investigation process is over, we’ll share further details about the causes, implemented fixes and preventive actions to be implemented.

Posted Dec 18, 2019 - 14:44 UTC

Monitoring

The unavailability has been fixed and the access to Pipefy’s web and mobile application has been restored.
We are currently monitoring the system to ensure all features are working as expected. As soon as the monitoring process is over we’ll share further details about the causes, investigation and preventive actions to be implemented.

Posted Dec 18, 2019 - 14:30 UTC

Identified

The technical team has identified the causes of the system unavailability and is currently working towards restoring full access to all users as soon as possible.
Any further details about the system status, investigation and preventive actions to avoid future incidents will be shared as soon as available.

Posted Dec 18, 2019 - 13:55 UTC

Update

We are currently investigating the system unavailability and working towards restoring full access to all users as soon as possible.
Any further details about the system status, investigation and preventive actions to avoid future incidents will be shared as soon as available.

Posted Dec 18, 2019 - 13:30 UTC

Investigating

We are currently investigating the issue.

Posted Dec 18, 2019 - 12:59 UTC

This incident affected: Application, API (GraphQL), and Mobile App.