Root Cause
The delay occurred because we failed to process a specific update, which was unable to be sent to the queue that later goes to the external system. This happened due to a connection field being incorrectly updated through the API.
Resolution
As several updates were taking place during the incident, we had to implement a solution in the application to deal with this exception. Essentially, the payload size is now validated before being sent to the queue, which solved the problem during the incident. Other actions have been mapped for further analysis.
Action Plan
We will create a disaster recovery process and resume discussion on alarm systems.