Check Point Services Status
All Systems Operational
Check Point ThreatCloud Operational
90 days ago
100.0 % uptime
Today
Check Point SandBlast Threat Emulation Cloud Operational
90 days ago
100.0 % uptime
Today
Check Point SandBlast Threat Emulation Cloud - San Jose, California, USA Operational
90 days ago
100.0 % uptime
Today
Check Point SandBlast Threat Emulation Cloud - Dublin, Ireland Operational
90 days ago
100.0 % uptime
Today
Check Point SandBlast Threat Emulation Cloud - Ashburn, North Virginia, USA Operational
90 days ago
100.0 % uptime
Today
Check Point SandBlast Threat Emulation Cloud - Sydney, Australia Operational
90 days ago
100.0 % uptime
Today
Check Point SandBlast Threat Emulation Cloud - Ningxia, China Operational
90 days ago
100.0 % uptime
Today
Check Point SandBlast Threat Emulation Cloud - Frankfurt, Germany Operational
90 days ago
100.0 % uptime
Today
Check Point Infinity Portal ? Operational
90 days ago
99.86 % uptime
Today
Check Point Infinity Portal EU Region Operational
90 days ago
99.81 % uptime
Today
Check Point Infinity Portal US Region Operational
90 days ago
99.99 % uptime
Today
Check Point Infinity Portal AU Region Operational
90 days ago
100.0 % uptime
Today
Infinity Portal DataTube Operational
90 days ago
99.6 % uptime
Today
Check Point Infinity Next Cloud Operational
90 days ago
100.0 % uptime
Today
Check Point Infinity Next Cloud EU Region Operational
90 days ago
100.0 % uptime
Today
Check Point Harmony Connect (CloudGuard Connect) Operational
90 days ago
99.99 % uptime
Today
Check Point Quantum Smart-1 Cloud Operational
90 days ago
100.0 % uptime
Today
Check Point Quantum Smart-1 Cloud - EU Region Operational
90 days ago
100.0 % uptime
Today
Check Point Quantum Smart-1 Cloud – US Region Operational
90 days ago
100.0 % uptime
Today
Check Point Quantum Smart-1 Cloud – APAC Region Operational
90 days ago
100.0 % uptime
Today
Check Point Harmony Endpoint Cloud Management (Sandblast Agent) Operational
90 days ago
99.73 % uptime
Today
Check Point Harmony Mobile (Sandblast Mobile) Operational
90 days ago
100.0 % uptime
Today
Check Point Harmony Email & Office 2.0 (Cloud MTA) Operational
90 days ago
99.89 % uptime
Today
Check Point Harmony Email & Office (CloudGuard SaaS) Operational
90 days ago
99.93 % uptime
Today
Check Point Quantum Spark Portal (SMB Appliances) Operational
90 days ago
100.0 % uptime
Today
smbmgmtservice.checkpoint.com Operational
90 days ago
100.0 % uptime
Today
smp1.checkpoint.com Operational
90 days ago
100.0 % uptime
Today
Check Point Quantum Spark Reach My Device Service (SMB Appliances) Operational
90 days ago
100.0 % uptime
Today
Check Point Quantum Spark Zero Touch Service (SMB Appliances) ? Operational
90 days ago
100.0 % uptime
Today
Check Point Quantum Update Service ? Operational
90 days ago
99.94 % uptime
Today
Check Point SandBlast Cloud for Office 365 Operational
90 days ago
99.97 % uptime
Today
Check Point Capsule Cloud Operational
90 days ago
100.0 % uptime
Today
Capsule Workspace Push Notifications Service Operational
90 days ago
100.0 % uptime
Today
Check Point Web Sites Operational
90 days ago
99.96 % uptime
Today
Check Point UserCenter / PMAP Operational
90 days ago
99.9 % uptime
Today
Check Point Beyond Operational
90 days ago
100.0 % uptime
Today
Check Point Support Center Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Apr 16, 2021

No incidents reported today.

Apr 15, 2021

No incidents reported.

Apr 14, 2021

No incidents reported.

Apr 13, 2021
Resolved - This incident has been resolved.
Apr 13, 12:17 UTC
Identified - The issue has been identified and a fix is being implemented.
Apr 13, 10:08 UTC
Apr 12, 2021
Resolved - This incident has been resolved.
Apr 12, 13:32 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Apr 12, 13:15 UTC
Identified - The issue has been identified and a fix is being implemented.
Apr 12, 12:26 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Apr 12, 11:59 UTC
Investigating - Office365 customers in Europe may experience issues with email delivery or issues with security enforcement.
Our teams are investigating the issue.
Apr 12, 11:17 UTC
Resolved - This incident has been resolved.
Apr 12, 13:31 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Apr 12, 13:15 UTC
Identified - The issue has been identified and a fix is being implemented.
Apr 12, 12:25 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Apr 12, 11:59 UTC
Identified - The issue has been identified and a fix is being implemented.
Apr 12, 11:58 UTC
Update - We are continuing to investigate this issue.
Apr 12, 11:21 UTC
Investigating - Customers in Europe may have issues with Identity Protection login sequence or issues with security enforcement for Identity Protection.
Our teams are investigating the issue.
Apr 12, 11:19 UTC
Resolved - Executive Summary
----------------------

On April 12, the Infinity Portal was down for 2 hours due to a networking issue in the EU cluster. There was no access to the UI or EU API. Only direct calls to the US API were still available. Australia-based Infinity Portal was working as usual.


Incident Timeline
---------------------

April 12, ~10:00

DevOps assigned on-call receives a complaint from ThreatCloud about problems in the EU cluster. It was explained that there are many pods in the EU cluster that are stuck in pending state.

April 12, 10:00-12:30

In attempt to fix the reported issue, the DevOps team tried to manually add more needed resources by adding nodes to the relevant node group.

April 12, 12:31

Infinity Portal is down.

OpsGenie alerts are triggered for both DevOps and CloudInfra teams that indicates about an issue with the environment. A message from an SE is also received in the SOS channel.

April 12, 12:42
Status Page is updated and an SOS announcement is made in the Teams channel for TAC and support personnel to notice.


April 12, 14:50
Issue is resolved completely.

Status Page is updated and the announcement is updated in the Teams channel.


Detailed Description
-----------------------

CloudInfra DevOps team is responsible for maintaining the various Kubernetes clusters used by CloudInfra and its hosted applications. Currently there are 5 clusters:

kube1 – Shared cluster for staging environment.

eu-west-1-kube – Shared cluster for production environment.

eu-west-1-kube2 – CloudInfra & ThreatCloud cluster for production environment.

us-east-1-kube1 – ThreatCloud cluster for production environment.

us-east-1-kube2 – CloudInfra & shared cluster for production environment.


Its noticeable that CloudInfra services are not deployed to dedicated clusters, meaning that any issue with a cluster that occurs without relation to CloudInfra might cause outage to Infinity Portal and its services.

On the morning of April 12th it was reported by the ThreatCloud team that there are many nodes in pending state on eu-west-1-kube2, which might be caused due to insufficient resources in their dedicated node-group.

Usually it should be solved automatically by the Autoscaling feature that should add/remove nodes according to the requested resources of the pods. (It was found later on that the auto scaling has an issue and therefore didn’t work properly)

In attempt to fix the insufficient resources issue, the DevOps team tried to manually add more nodes to the relevant node groups with pending nodes:

nodes-threatcloud-prod
nodes-threatcloud-mta-prod

The manual attempt failed due to out of sync issues between the kOps cluster and the matching AWS auto scaling group. This attempt added around 100 nodes (because no change to the pods was seen).

Adding too much nodes caused failure in the Weave networking component which had a limit of 200 maximum nodes.

The error message indicating that there are too much nodes in the WeaveNet led to understanding of the issue and solving it by increasing the limit (from 200 to 300). Restarting all weave-net pods on all nodes in the relevant cluster with the updated environment variable solved the issue.

Later that day (around 4am), the auto scaling group out of sync issue was solved by the DevOps team, which reduced the number of nodes to its original number.

Summary
---------------

Infinity Portal was down for 2 hours on April 12, 2021.

This and past incidents show that the fact that CloudInfra nodes are hosted in the same cluster as other applications is a bad practice and should be changed immediately.

Action items:
---------------

- Fix the auto scaling group – Done

- Reduce number of nodes to a reasonable number – Done

- Move CloudInfra to a dedicated cluster in EU production – In progress. ETA: April EOM
Apr 12, 11:53 UTC
Update - We are continuing to investigate this issue.
Apr 12, 10:09 UTC
Update - We are continuing to investigate this issue.
Apr 12, 09:51 UTC
Update - We are continuing to investigate this issue.
Apr 12, 09:50 UTC
Investigating - We are currently investigating this issue.
Apr 12, 09:49 UTC
Apr 11, 2021
Completed - The scheduled maintenance has been completed.
Apr 11, 15:20 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Apr 11, 13:21 UTC
Scheduled - Smart-1 Cloud is currently under maintance for EU region.
Creating a new Smart-1 Cloud might not be available.
Expected impact on existing environments is low.
Apr 11, 13:19 UTC
Completed - The scheduled maintenance has been completed.
Apr 11, 12:25 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Apr 11, 11:55 UTC
Scheduled - Web Management components update. Some disruptions may occur during that time.
Apr 11, 11:51 UTC
Completed - The scheduled maintenance has been completed.
Apr 11, 11:47 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Apr 11, 09:00 UTC
Scheduled - We will be undergoing scheduled maintenance during this time.
Apr 8, 12:24 UTC
Apr 10, 2021

No incidents reported.

Apr 9, 2021

No incidents reported.

Apr 8, 2021
Resolved - This incident has been resolved.
Apr 8, 21:30 UTC
Identified - Logs view for existing environment is working.
we are still investigating the issue of onboarding new customers.
Apr 8, 16:54 UTC
Investigating - Due to the running incident on DataTube, new Smart-1 Cloud customers are not able to on-board, and in exiting environments the logs view returns "query failed" error .
Apr 8, 11:37 UTC
Resolved - This incident has been resolved.
Apr 8, 17:19 UTC
Investigating - Some customers may have issues viewing logs or overview reports.
The issue is related to the Infinity DataTube issue and is being investigated.

There is no impact on email flow or security functionality
Apr 8, 15:28 UTC
Resolved - This incident has been resolved.
Apr 8, 15:30 UTC
Update - We are continuing to investigate this issue.
Apr 8, 11:08 UTC
Investigating - We are currently investigating this issue.
Apr 8, 11:07 UTC
Apr 7, 2021
Resolved - This incident has been resolved.
Apr 7, 16:36 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Apr 7, 14:45 UTC
Identified - The issue has been identified and a fix is being implemented.
Apr 7, 14:20 UTC
Investigating - We are currently investigating the issue.
When creating a new account for Harmony Endpoint, management shows error messages after deployment.
Apr 7, 10:28 UTC
Apr 6, 2021

No incidents reported.

Apr 5, 2021

No incidents reported.

Apr 4, 2021

No incidents reported.

Apr 3, 2021

No incidents reported.

Apr 2, 2021

No incidents reported.