Infinity portal is down

Incident Report for Check Point Services Status

Resolved

Executive Summary
----------------------

On April 12, the Infinity Portal was down for 2 hours due to a networking issue in the EU cluster. There was no access to the UI or EU API. Only direct calls to the US API were still available. Australia-based Infinity Portal was working as usual.

Incident Timeline
---------------------

April 12, ~10:00

DevOps assigned on-call receives a complaint from ThreatCloud about problems in the EU cluster. It was explained that there are many pods in the EU cluster that are stuck in pending state.

April 12, 10:00-12:30

In attempt to fix the reported issue, the DevOps team tried to manually add more needed resources by adding nodes to the relevant node group.

April 12, 12:31

Infinity Portal is down.

OpsGenie alerts are triggered for both DevOps and CloudInfra teams that indicates about an issue with the environment. A message from an SE is also received in the SOS channel.

April 12, 12:42
Status Page is updated and an SOS announcement is made in the Teams channel for TAC and support personnel to notice.

April 12, 14:50
Issue is resolved completely.

Status Page is updated and the announcement is updated in the Teams channel.

Detailed Description
-----------------------

CloudInfra DevOps team is responsible for maintaining the various Kubernetes clusters used by CloudInfra and its hosted applications. Currently there are 5 clusters:

kube1 – Shared cluster for staging environment.

eu-west-1-kube – Shared cluster for production environment.

eu-west-1-kube2 – CloudInfra & ThreatCloud cluster for production environment.

us-east-1-kube1 – ThreatCloud cluster for production environment.

us-east-1-kube2 – CloudInfra & shared cluster for production environment.

Its noticeable that CloudInfra services are not deployed to dedicated clusters, meaning that any issue with a cluster that occurs without relation to CloudInfra might cause outage to Infinity Portal and its services.

On the morning of April 12th it was reported by the ThreatCloud team that there are many nodes in pending state on eu-west-1-kube2, which might be caused due to insufficient resources in their dedicated node-group.

Usually it should be solved automatically by the Autoscaling feature that should add/remove nodes according to the requested resources of the pods. (It was found later on that the auto scaling has an issue and therefore didn’t work properly)

In attempt to fix the insufficient resources issue, the DevOps team tried to manually add more nodes to the relevant node groups with pending nodes:

nodes-threatcloud-prod
nodes-threatcloud-mta-prod

The manual attempt failed due to out of sync issues between the kOps cluster and the matching AWS auto scaling group. This attempt added around 100 nodes (because no change to the pods was seen).

Adding too much nodes caused failure in the Weave networking component which had a limit of 200 maximum nodes.

The error message indicating that there are too much nodes in the WeaveNet led to understanding of the issue and solving it by increasing the limit (from 200 to 300). Restarting all weave-net pods on all nodes in the relevant cluster with the updated environment variable solved the issue.

Later that day (around 4am), the auto scaling group out of sync issue was solved by the DevOps team, which reduced the number of nodes to its original number.

Summary
---------------

Infinity Portal was down for 2 hours on April 12, 2021.

This and past incidents show that the fact that CloudInfra nodes are hosted in the same cluster as other applications is a bad practice and should be changed immediately.

Action items:
---------------

- Fix the auto scaling group – Done

- Reduce number of nodes to a reasonable number – Done

- Move CloudInfra to a dedicated cluster in EU production – In progress. ETA: April EOM

Posted Apr 12, 2021 - 11:53 UTC

Update

We are continuing to investigate this issue.

Posted Apr 12, 2021 - 10:09 UTC

Update

We are continuing to investigate this issue.

Posted Apr 12, 2021 - 09:51 UTC

Update

We are continuing to investigate this issue.

Posted Apr 12, 2021 - 09:50 UTC

Investigating

We are currently investigating this issue.

Posted Apr 12, 2021 - 09:49 UTC

This incident affected: Infinity Portal (Infinity Portal EU Region, Infinity Portal US Region).