August 31, 4:02 PM PDT We have resolved the issue affecting network connectivity within a single Availability Zone (usw2-az2) in the US-WEST-2 Region. Beginning at 10:58 AM PDT, we experienced network connectivity issues for Network Load Balancer, NAT Gateway and PrivateLink endpoints within the US-WEST-2 Region.
At 2:45 PM, some Network Load Balancers, NAT Gateways and PrivateLink endpoints began to see recovery and by 3:35 PM, all affected Network Load Balancers, NAT Gateways and PrivateLink endpoints had fully recovered. The issue has been resolved and the service is operating normally.
— from AWS Service Health Dashboard
Regions and Zones
In order to better explain the incident, we need to dive a little bit into what AWS regions and availability zones are.
AWS has the concept of a Region, which is a physical location around the world where we cluster data centers. We call each group of logical data centers an Availability Zone. Each AWS Region consists of multiple, isolated, and physically separate AZs within a geographic area.
Each AZ has independent power, cooling, and physical security and is connected via redundant, ultra-low-latency networks. AWS customers focused on high availability can design their applications to run in multiple AZs to achieve even greater fault-tolerance.
— excerpt from AWS Regions and Availability Zones
Picture - Multiple Availability Zones in an AWS region
Regarding the incident, requests to any service deployed in the impacted availability zone (data center) were very slow, or failing (timing out) at a higher rate - in the case of this incident our telemetry measured a fail rate around 10-15%.
Cloud applications (usually) consist of multiple services. Here are a few examples of how the impact could look like. If a fictional application:
- was leveraging only the affected AZ with their services it would be failing at 10-15% rate.
- had a key service in the affected AZ, other services would be ok, but anything depending on the key service would be failing.
- had any networking component (balancer, ingress) deployed in the affected AZ, those requests would be failing.
- was following AWS well architected framework guidance to “Distribute workload data and resources across multiple Availability Zones”, and was running multiple instances of each service distributed across multiple AZs, every request reaching replica of any service in the given AZ would fail (would timeout in this specific case) at an elevated rate.
In OneLogin, we are focused on high availability, so the system is designed to leverage multiple AZs (usually at least 3 in a given region).
Note: Since we follow the recommended high-resiliency, high-availability practice, we allow for cross-zone calls. But if the whole AZ is failing, it means that the request can fail on each layer of the topology. So, in the case of a complex solution with many inter-service calls, the total failure rate for each end user request may be higher than the raw Availability Zone fail rate (depending on the inter-service calls topology).
Picture - OneLogin deployment in US with Availability Zone affected
Login Clusters: To The Rescue!
In my previous post, I described how we designed Login Clusters as a way to achieve ultra high scale and reliability. This incident gave a clear opportunity to test that architecture!
So how did we do?
Thanks to our mature telemetry and synthetic monitoring, our engineers and ops were promptly alerted to the issue right at the onset of the incident at 11:00 AM.
The team immediately jumped on the incident-bridge call and started assessing the situation.
We quickly narrowed the issue to the us-west-2 region, but within the region, the situation looked more complex. Each of our 75 services was reporting elevated failures, but nothing was completely failing.
In similar cases, often one of the key backend components that is used by most other services - like a database or a message queue - is the cause. But it can also be some networking component or any other wide-spread cause.
The clock is ticking, and in these situations it is crucial to have your solution designed so that you can immediately take mitigating actions. Only then can you continue looking for the root cause and resolution since every second counts.
We assessed our options:
- Login cluster failover - We run two login clusters (with a traffic 50/50 split) in each region. This is a less aggressive and faster mitigation action, but as both clusters were reporting failures this would not help.
- Region failover - For our end user facing traffic, any region can take traffic from the other region. Our us-east-2 region did not report any failures, so we decided to initiate our script that would prescale the target region to add capacity for the source region and then remove the failing region out of service, resulting in moving all traffic to the healthy (us-east-2) region.
At 11:16 AM, we initiated traffic failover from the affected region and by 11:20 AM (20 min into the incident), the majority of the end user functionality was fully recovered.
Picture - Traffic rate in us-west-2 (blue) and us-east-2 (yellow) regions
The picture shows traffic split between regions and the impact of the region failover. The leftover traffic to the us-west-2 are the admin-related flows.
As described in one of our previous blog posts, we separate ingress traffic to OneLogin in two groups: End user and Admin.
End user login is any requests to OneLogin on behalf of an end user attempting to access OneLogin, authenticate to OneLogin, or authenticate to or access an app via OneLogin, whether via the OneLogin UX, supported protocol, or API
The End user login is our most critical functionality and, therefore, gets special focus and much higher reliability requirements.
Now that we have seen recovery in our telemetry and got confirmation from our customer support team that the situation stabilized, we needed to:
- Look for any residual failures
- Find, understand and fix the root cause
A quick look in our telemetry revealed what we expected, that the admin traffic still had elevated failure rates. The impacted us-west-2 region was what we call the “primary region”, where we host our primary r/w database and also the admin cluster.
Reconstructing the admin cluster in the secondary region is a more complex process and provided that admin traffic has much less urgency than end user traffic, we focused on finding and resolving the root cause.
Our teams continued to work on the full recovery. When AWS announced at 12:13 PM that impact was to a single Availability Zone (AZ) we focused fully into diverting traffic and removing our services from the affected AZ in our primary us-west-2 region.
Our Kubernetes cluster design and its node group distributions allowed us to relatively easily drain services and ingress in the affected Availability Zone (AZ). The affected AZ was also removed from the relevant load balancers. This resulted in most requests to the platform succeeding.
We had more problems with some of the edge flows, but an active discussion over the open incident bridge with our AWS partners helped us to discover a single misconfigured VPC, that had all subnets routed through a single NAT gateway in the problematic AZ. Once this configuration was fixed all timeouts were resolved and at 1:16 PM service was fully recovered.
There was a recurrence of failures (only admin-traffic) during the window of 2:06PM - 3:20PM, because some infrastructure components that the team drained earlier scaled automatically up in the still affected AZ
There are many followup actions we are taking after each incident to make sure we prevent the same or similar issue from happening again, mitigate impact faster and learn from the mistakes we have made.
The cornerstone of our aftermath actions are Postmortem reviews. The goal of these reviews is to capture detailed information about the incident, to identify corrective actions that will prevent similar incidents from happening in the future, and to track specific work items to perform corrective action. The no-blame postmortem reviews serve as both learning and teaching tools.
The full Postmortem writeup is quite long, with many details. Following are some major findings.
What Went Well
- Monitoring detected the incident quickly and our team immediately escalated and opened an incident bridge
- AWS (via their AWS Enterprise Support) spun up a dedicated bridge to give us rapid feedback and guidance on the incident
- No customer reports on end user flows
What Went Wrong
- Readonly API and Vigilance AI services were affected for longer than expected
- The reaction time to fix end user flow (20 mins) was great, but still not where we want it to be (in our goals we are targeting seconds, not minutes)
- Misconfigured NAT gateway
- Later recurrence of errors for admin traffic
Actions to take
- Finish support of API read only tier in secondary regions (login cluster)
- Finish Vigilance AI implementation in in secondary regions (login cluster)
- Fix single point of failure (NAT Gateway) for legacy VPC in USW2
- Move traffic away from affected datacenters faster (automation)
- Add better drain command to our incident playbook that prevents scaling instances back in drained AZ
All the above items got assigned tickets with target date and will be tracked as part of our Corrective and Preventive Actions (CAPA) process .
This was a pretty widespread incident which put each service, application and platform under the same conditions, so we were obviously curious how similar platform services in our sector that used the same region handled the incident.
We have looked at similar services and their published impact. Following is a comparison of OneLogin and one of our direct, close competitors, based on the analysis of publicly available data.
Windows and failure rates overview
First, we looked at the windows of impact and average failure rate in these windows.
|Provider||Avg failure rate||Window of impact||Elapsed time|
|End user traffic||Competitor||~15%||10:59 AM - 3:35 PM||276 min|
|OneLogin||~3%||11:00 AM - 11:20 AM||20 min|
|Admin traffic||Competitor||~15%||10:59 AM - 3:35 PM||276 min|
|OneLogin||~3%||11:00 AM - 1:16 PM
2:06 PM - 3:20PM
The above table shows a clear difference. Not only is our failure rate 5-times lower, but the window of impact, especially for the most important end user traffic, is an order of magnitude shorter.
Failure rate comparison (whole incident window)
Let’s put this in better side-by-side comparison by averaging the failure rate of the same window of the whole incident.
|Provider||Avg failure rate||Incident window|
|End user traffic||Competitor||~15%||10:59 AM - 3:35 PM|
|OneLogin||~0.22% (us)||10:59 AM - 3:35 PM||70x better|
|Admin traffic||Competitor||~15%||10:59 AM - 3:35 PM|
|OneLogin||~2.3%||10:59 AM - 3:35 PM||7x better|
Not bad, we were about 70 times better in this direct comparison on the most important flows!
Here’s also a visualisation of the fail rate and length of impact:
Mitigation actions taken
There was one additional big difference that struck us after looking closely at the competitor’s RCA. While we have been successfully preventing further impact, they have not taken any obvious action as all their mitigations align with recoveries on the AWS side. Either their product design did not allow them to make any mitigation or they lacked expertise to resolve it.
Either way, this makes us believe that we are on the right track.
In this blog post I have let you look under the cover of one of the reliability incidents, its resolution and aftermath.
We have also shown the value of a resilient architecture, no single points of failure, and operational excellence, which combined to provide a substantially more reliable and available service when subject to exactly the same underlying infrastructure failures compared to one of our main competitors.
Although we are not fully satisfied with the result - our goal is no impact on our customers even under these circumstances - I truly believe that we are on the right track and already have a world class team and product!