Root Cause
The issue was due to a memory overload on certain database nodes. This happened because some database queries used more memory than what was available, causing the system to automatically shut down important processes. This led to a crash and the service couldn't restart on its own.
Resolution
Once the problematic nodes were identified, the database services were manually restarted, which resolved the errors and restored normal operations.
Action Plan
We plan to set up monitoring to catch similar issues in the future. Additionally, we will analyze the database settings to see if any adjustments are needed and conduct a review to understand what led to the memory shortage.