CloudGuard - US DC - Increase in error rate

Incident Report for Check Point Services Status

Postmortem

Summary

Between Monday, May 6, 2024, 12:39 UTC to 13:55 UTC, all users of the CloudGuard (US region) experienced degraded performance and failure to login.

The event was triggered by an extreme load on our internal services caused by internal activity and external API calls.

Our internal alerting and client reports were clear to point on a major issue.

The high load caused CloudGuard database to stop functioning.

The incident was mitigated eventually by recovering the database.

The system then became stable again.

Incident Timeline

Thursday, May 6, 2024, 12:39 UTC – An alert is triggered. A war room is created to diagnose the issues.

Thursday, May 6, 2024, 12:54 UTC – It’s clear an incident has started. Database is started to be recovered

Thursday, May 6, 2024, 13:00 UTC – The status page is updated.

Thursday, May 6, 2024, 13:45 UTC – The system shows signs of recovering

Thursday, May 6, 2024, 13:55 UTC – The system is up and running and is being monitored

Thursday, May 6, 2024, 14:45 UTC – It’s clear the system is back to being fully operational. No reported issues by clients for meaningful time. Closing the incident.

Root Cause Analysis

It was a rare combination of calls to CloudGuard database that resulted in extreme load that caused it to reach its limits.

It became degraded which caused the major outage.

Next Steps

We sincerely apologize for the recent outage of our system. We take our availability very seriously and we understand that this outage has caused you inconvenience. We appreciate your patience and understanding during this time. Further steps we are planning to take:

Identify and improve connections management to CloudGuard database
Add limitations to sources that access CloudGuard database

Posted May 09, 2024 - 15:09 UTC

Resolved

This incident has been resolved.

Posted May 06, 2024 - 14:47 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 06, 2024 - 14:14 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted May 06, 2024 - 13:02 UTC

This incident affected: CloudGuard CNAPP (CloudGuard - US Region).