Service Degradation Post-Mortem
Authors: SRE, Security and Support Teams
Some users have experienced a partial outage of Pipefy's email-related features from Apr 28 15:58 (UTC) to Apr 28 16:34 (UTC).
This instability was caused by a scheduled migration with the third party vendor that handles emails in our platform started on Apr 18. The migration was planned and scheduled to be executed in stages to avoid any issues that could impact the users of these services. The team also had a fallback plan to quickly rollback changes in case of issues,
Due to a misinterpretation of the communications there were some misalignments in the actions that demanded resetting the account and setting up important DNS changes. Due to the mechanism called TTL ("Time to Live") that establishes how long DNS chances stay in cache before being updated, some users were able to use the feature normally within minutes while others that had TTL set for longer experienced issues for about an hour.
Major - Several users experienced partial outage on the email-related features from Apr 28 15:58 (UTC) to Apr 28 16:34 (UTC).
The identified root cause of this incident was defined as an operational error in the agreed upon procedures with the third party email service.
Detection and resolution:
The issue was detected by our internal monitoring system that triggered an alert and informed the team. We have also received reports from internal and external clients.
After the root cause was identified, our team was able to quickly connect with the vendor and apply the correct settings of the IPs to the sub accounts in the system.
Action plan: Preventive action items
1. Review and improve internal email errors monitoring systems to identify any abnormality instantly. Due date: 06/21