In one of my previous blog posts, I described how we started our journey to achieve five nines of reliability (and why is it so critical for identity management systems like OneLogin), how we defined our metric, first successes, first failures, and where we ended in 2020.
This next part is a continuation of that journey through October 2021, as well as a deeper dive into a few key steps on our path.
If you haven’t, I highly recommend reading the previous post first.
State at the end of 2020
The chart on top summarizes where we ended at the end of 2020 and the contribution of the key project “Red-Zone” (an engineering-wide focused effort on all the necessary reliability works).
At the end of 2020, we reset our original goal of four nines to a new one:
“Reach four and a half nines (99.995%) by end of Q1 and five nines (99.999%) by end of 2021.”
After we had the End-User Login Success metric fully in place, we immediately started realizing that a big portion of the failures were “normal” failures that were spread in almost each of the services.
These were things like:
- Incomplete handling of incorrect request parameters (that threw “500 Internal Error” response instead of “400 Bad Request”)
- Incorrectly handled edge-case scenarios (that were in the product often from early days)
- Random network connectivity problems between services (returning “502 Bad Gateway” or “504 Gateway Timeout”)
- Too long requests (Gateway Timeout)
- Not aligned timeouts (i.e. shorter timeout on gateway than on downstream service)
- No retry on dead keepalive connections, etc.
We started to call these “normal” failures “Leaky bucket.” The Leaky bucket cases were not only adding significantly to our overall fail rates but were hiding more serious issues like regression bugs after new releases, small bursts of traffic that we did not handle well or elevated failure rates.
Initially, we split all the failures in two main categories:
- Leaky bucket
Note on incidents: We call an “incident” any elevated error rates or any impact to certain use cases regardless of its scale. We classify incidents by number of failed requests as per our success metric and then put them in one of the buckets: major, medium, minor or very minor.
Most of the incidents fall in the minor or very minor bucket and are usually not even noticed by our customers. We still track them and do rigorous postmortem as even the very minor ones have an impact on our high reliability goal.
We created an aggressive plan to get rid of the Leaky bucket cases - they were easier to address and fixing them would provide us with much better visibility into the other issues.
Starting September 2020, we set a goal to decrease Leaky bucket cases each month by 30%. The process was:
- Each team would identify the highest Leaky bucket offender(s) with relatively low work to fix (low hanging fruits first)
- Release fix
- Review data, goto 1
As mentioned in the first part of this series, we instantiated Error Budgets early in 2020. Error Budgets have proven to be a very efficient tool to motivate teams and drive down especially the Leaky bucket failures.
As the service owners were improving the overall failure rates, the Site Reliability Engineering team had been continuously setting up higher targets for each service so that the service metric results were hovering around the next target. This helped the motivation to continuously improve by achieving the next iterative goal, but not demotivate the team with the ultimate goal (as five nines has been seen as rather a dream than reality at that time).
As the noise in the data was lowering, we have been improving the Reliability and Error Budgets dashboards to get a more fine-grained understanding of the data from each underlying service as well as making both the dashboards and alerts faster and more sensitive.
Eventually we ended up with two dashboards (supported by real-time alerts that detected anomalies). The first dashboard tracks progress over a longer time and is also used for regular monthly assessments.
We call it the “SLO Dashboard.” Below is a recent example filtered for the month of October in our EU shard:
It measures our global SLO number and details each of the major services against our current goal (99.995%). It also shows a comparison of the failure rates between services - this was instrumental to identify the next offending target in our “Leaky bucket war.”
The second dashboard was introduced after most of the Leaky bucket issues were cleared. During a reliability incident, we wanted a quick overview of the impact to our end users as well as a detailed analysis of what was wrong. We needed many more details, especially with a focus on much smaller windows (usually minutes to hours).
So we made the “Investigation Dashboard.” It provides additional details like:
- Top most failing endpoints
- Grid of status codes per service
- Timelines of failures per status codes, services, region, endpoint
Plus, it can be further filtered down by various criteria. Below is an example of the Investigation dashboard:
As a reminder, we ended 2020 with a December result of 99.993%. So how have we done in 2021?
Our push to reduce Leaky bucket got us close to 99.995% for the first time in March 2021. In April an incident put us back down and then in May we suffered an even more significant incident.
At this point, we realized that fixing even more Leaky buckets does not provide the most significant results and we fully refocused our efforts to reduce and mitigate incidents.
In June we celebrated a huge victory of reliability well over our monthly goal. This was in part a result of the refocus but, honestly, there was also a bit of luck.
“Luck is what happens when preparation meets opportunity.”
We had other incidents in July and August, but as you can see, these were significantly smaller - and going forward we were further able to reduce their occurrences as well as their impact.
To Five Nines or Not To Five Nines?
This process and feedback from our customers made us realize a few key things:
Five nines of reliability is not necessary for the level of the service such as OneLogin. Even our largest and/or most sensitive customers did not notice until service levels dropped below four nines.
To achieve this stable four nines, we need to target above four and a half nines - basically 99.995% should be our comfort level which we can keep even with small to medium scale issues.
That is why - even though we now liked the very ambitious goal of five nines - we have rather changed our goal to “Consistently and reliably achieve 99.995% of reliability” and reshuffled our investments to make sure we won’t suffer from incidents like that of May 2021.
Summary and Learnings
To fully realize the reliability improvements let’s put the year 2020 in same chart with 2021:
The ultimate goal may look unattainable at the beginning, but it can always be split into small, iterative steps. Do not get frustrated by the big goal, instead focus on the achievable smaller goals and the ultimate goal suddenly starts to be real.
Defining the metric was the basis and a turning point. Our unified goal drove iterative process improvements like Error Budgets tooling, CAPA process, Red-Zone initiative, and our Hydra Infrastructure redesign which allowed our amazing team to have a dramatic impact resulting in world-class reliability.
Achieving each additional nine of reliability means that the system must be 10-times more resilient, reliable and generally better. This needs significant investment on a similar scale - if done well not necessarily 10x, but for sure some multiplier.
Realizing what is the right level of reliability for your service and being flexible enough to redefine your goal, instead of blindly chasing some high number that does not provide additional significant benefit for customers is the final learning on our path.
Our goal is to consistently achieve 99.995%, which essentially means zero impact to our end users. What’s your goal?