Pipefy is unstable

Incident Report for Pipefy

Postmortem

Authors: SRE, DE, Support team

Date: 2020-05-05

Status: Resolved

Summary:

There was a performance degradation that made the application unstable from May 05 13:39 (UTC) to May 05 14:37 (UTC).

Impact:

Minor - The performance degradation on the application happened from May 05 13:39 (UTC) to May 05 14:37 (UTC) and was perceived mostly as latency and error messages.

Root Cause:

The identified root cause of this incident was defined as an unexpected consequence of a recent deploy that caused one of the system’s structures to malfunction.

Detection and resolution:

The issue was first detected by our internal monitoring systems that triggered automated alerts and informed the team that our users may have been experiencing latency in the application.

The support team has also received messages from both internal and external customers informing that they were experiencing unexpected behaviors.

After the root cause was identified, the team performed a rollback of the deployed code which resolved the latency and restored the platform’s performance.

Action plan: Preventive action items

1. Internally restructure old asynchronous processings to mitigate the risk of having another incident . Due date: 09-01-2020

2. Compile all the information processed by our monitoring tools to properly assess and pinpoint the cause of the performance degradation. Due date: 05-22-2020

3. Implement an upgraded and more complete post-deploy checklist. Due date: 05-22-2020

Posted May 11, 2020 - 19:43 UTC

Resolved

The unavailability has been fixed and the access to Pipefy’s web and mobile application has been restored.
As soon as the preventive investigation process is over, we’ll share further details about the causes, implemented fixes and preventive actions to be implemented.

Posted May 05, 2020 - 14:37 UTC

Monitoring

The unavailability has been fixed and the access to Pipefy’s web and mobile application has been restored.
We are currently monitoring the system to ensure all features are working as expected. As soon as the monitoring process is over we’ll share further details about the causes, investigation and preventive actions to be implemented.

Posted May 05, 2020 - 14:11 UTC

Update

We are currently investigating the system unavailability and working towards restoring full access to all users as soon as possible.
Any further details about the system status, investigation and preventive actions to avoid future incidents will be shared as soon as available.

Posted May 05, 2020 - 13:51 UTC

Update

We are continuing to investigate this issue.

Posted May 05, 2020 - 13:39 UTC

Investigating

We are currently investigating this issue.

Posted May 05, 2020 - 13:39 UTC

This incident affected: Application and Mobile App.