Summary
On Sunday, May 15th between 07:30 UTC and 13:00 UTC we’ve noticed low rate of server errors on our logs Infinity Portal Data Services API and began investigating.
We’ve found that the errors originated in Azure Cosmos, and were caused by exceeding rate of requests. The team increased the capacity of Cosmos to handle more requests.
That didn’t resolve the issue and we’ve increased the capacity again. W have contacted Microsoft and opened a support ticket to help us resolve the issue.
One issue was found related to incorrect API ingestion requests by CloudGuard Intelligence Application, which caused endless redundant calls on Azure Cosmos DB repository (Find tenant).
Intelligence Application version was rolled back, restarted Infinity Portal Data Services API and handled the ingestion backlog.
Incident Timeline
07:30 UTC May 15 – Infinity Portal Data Services errors started
08:30 UTC May 15 – Investigation started found issues from Azure Cosmos
09:30 UTC May 15 – Cosmos scaled out – to resolve request exceeding quota
11:45 UTC May 15 – Cosmos scaled out again and Microsoft ticket was opened
15:30 UTC May 15 – Intelligence Application rolled back to previous version (root cause)
19:30 UTC May 15 – Recreated Infinity Portal Data Services repository collection on Azure cosmos DB
21:30 UTC May 15 – Scaled out pipeline to handle ingestion backlog
23:30 UTC May 15 – Pipeline failed to handle backlog – restarted Infinity Portal Data Services pipeline
08:30 UTC May 16 – No more errors, ingestion backlog decreasing
Root Cause Analysis
Infinity Portal Data Services pipeline verifies tenant existence when an application ingests data. It uses a caching mechanism to reduce amount of calls to Azure Cosmos DB repository.
Intelligence App deployed new version to production, migrating to new ingestion method, it was deployed with incorrect app credentials, this caused Infinity Portal Data Services pipeline to search for a tenant for each new file uploaded to the system, because no tenant was found for these incorrect app credentials and thus ignored the caching system.
In addition, we’ve found that scaling out Azure Cosmos didn’t help because of an incorrect partition key on tenants repository.
Rollback of Intelligence App solved the initial issue.
As a result, the ingestion backlog was created because ingestion stopped; it took some time for the system to release the bottle neck.