Incident information:
Incident ID | CLOIS-3191 |
---|---|
Start Date | Sunday, September 5, 2021, at 00:06 AM UTC |
End Date | Sunday, September 5, 2021, at 00:55 AM UTC |
Consequences | Infinity Portal EU/US partial outage - Specific flows involving specific services were non functioning |
Summary
Between 00:06 AM and 00:55 AM UTC on September 5, 2021 users in both EU and US data residencies couldn’t login to the Infinity Portal.
The event was triggered by an alert (#28252) at 00:15 AM that was acknowledged immediately by the on-call engineer. The specific alert is indicating if there is failure to login to the portal, and the number of unreachable data residencies.
Once the event was acknowledged, a software engineer, DevOps engineer and an On-Duty manager (Group Manager) collaborated to work on resolving the problem.
The issue was fixed once we restarted the active 100 instances of the gateway service in EU data center, and rollout restarted users service in both EU and US data centers.
Root Cause Analysis
The root cause of the issue was a networking failure between our gateway and some specific services (users, geo-discovery, etc.), as multiple requests got 500 or 504 status code responses for no apparent reason. Most responses took less than 50ms so having 504 (Gateway timeout) is very strange behavior. Once we restarted our EU gateway instances there were no 500 or 504 responses.
Due to this specific incident, a total of 208 requests failed between 03:00 IST and 04:00 IST.
Actions Taken
00:18 UTC – Alert created
00:22 UTC – Alert acknowledged
00:50 UTC – EU Gateway instances restarted
00:55 UTC – Issue resolved
Next Steps
Action | Completion Date |
---|---|
Contact AWS support for help investigating the networking issue between the gateway to our multiple services – DevOps team | Immediately |