Infinity Portal Data Services - some customers are experiencing ingestion timeouts
Incident Report for Check Point Services Status
Postmortem

Summary

On Sunday, May 15th between 07:30 UTC and 13:00 UTC we’ve noticed low rate of server errors on our logs Infinity Portal Data Services API and began investigating.

We’ve found that the errors originated in Azure Cosmos, and were caused by exceeding rate of requests. The team increased the capacity of Cosmos to handle more requests.

That didn’t resolve the issue and we’ve increased the capacity again. W have contacted Microsoft and opened a support ticket to help us resolve the issue.

One issue was found related to incorrect API ingestion requests by CloudGuard Intelligence Application, which caused endless redundant calls on Azure Cosmos DB repository (Find tenant).

Intelligence Application version was rolled back, restarted Infinity Portal Data Services API and handled the ingestion backlog.

Incident Timeline

07:30 UTC May 15 – Infinity Portal Data Services errors started

08:30 UTC May 15 – Investigation started found issues from Azure Cosmos

09:30 UTC May 15 – Cosmos scaled out – to resolve request exceeding quota

11:45 UTC May 15 – Cosmos scaled out again and Microsoft ticket was opened

15:30 UTC May 15 – Intelligence Application rolled back to previous version (root cause)

19:30 UTC May 15 – Recreated Infinity Portal Data Services repository collection on Azure cosmos DB

21:30 UTC May 15 – Scaled out pipeline to handle ingestion backlog

23:30 UTC May 15 – Pipeline failed to handle backlog – restarted Infinity Portal Data Services pipeline

08:30 UTC May 16 – No more errors, ingestion backlog decreasing

Root Cause Analysis

Infinity Portal Data Services pipeline verifies tenant existence when an application ingests data. It uses a caching mechanism to reduce amount of calls to Azure Cosmos DB repository.

Intelligence App deployed new version to production, migrating to new ingestion method, it was deployed with incorrect app credentials, this caused Infinity Portal Data Services pipeline to search for a tenant for each new file uploaded to the system, because no tenant was found for these incorrect app credentials and thus ignored the caching system.

In addition, we’ve found that scaling out Azure Cosmos didn’t help because of an incorrect partition key on tenants repository.

Rollback of Intelligence App solved the initial issue.

As a result, the ingestion backlog was created because ingestion stopped; it took some time for the system to release the bottle neck.

Posted Jun 16, 2022 - 07:16 UTC

Resolved
This incident have been resolved.
Posted May 15, 2022 - 16:26 UTC
Identified
The issue was identified and the impact on customers was significantly reduced.
Still under Microsoft's treatment.
Posted May 15, 2022 - 15:24 UTC
Update
Update - Microsoft are investigating the issue
Posted May 15, 2022 - 14:07 UTC
Investigating
Some Customers are experiencing ingestion timeouts. We’re aware of the issue and are working to resolve it.

Please know our teams are doing their best to identify the root cause and implement a solution, and we will update you with the latest information as soon as possible.
Posted May 15, 2022 - 13:36 UTC
This incident affected: Infinity Portal (Infinity Portal Data Services (DataTube)).