Designing OneLogin for Scale and Reliability - Part 2: Anatomy of a Cluster

December 1st, 2021   |     |  product & technology

The first part of the “Designing OneLogin for Scale and Reliability” series discussed our design principles, key insights and infrastructure architecture focused on Login Clusters.

In this part, we are going to look deeper into the details of a cluster and how it is designed in OneLogin.

Clusters

A “cluster” is a bundled set of stateless (micro-)services with dedicated functionality. Currently, we have two types of clusters:

  • Login clusters - processing end user flows
  • Admin clusters - for admin tasks and job processing


Both types of clusters share the same design and for the sake of this post we won’t be differentiating between Login and Admin clusters and will call them both clusters.

A cluster (in the OneLogin Hydra Infrastructure Architecture) is a standalone Kubernetes (AWS Elastic Kubernetes Service) cluster. Each cluster is separated into Kubernetes Namespaces.

Cluster Namespaces

Namespaces can be dedicated to supporting components - i.e. kube-system, monitoring or flux namespace - or contain a full deployment of the OneLogin product. We refer to the second group as “product namespaces.” Product namespaces are fully isolated, sharing only underlying cluster-wide supporting services.

We have several use cases for product namespaces:

  1. Team namespaces - In our non-production (nonprod) environment we use product namespaces to create dedicated team (or personal) full deployments of the product while sharing the same physical EKS cluster and underlying platform support components (like monitoring). We use team namespaces as separate team-level playgrounds and integration workspaces for in-development features, improvements and bug fixes.

    Team namespaces diagram

    Figure - Team namespaces

  2. Release Channel namespaces - To support our channel rollout strategy for new versions, we plan to split the production system into more independent environments with defined release lifecycles for each channel (typically, the channels provide different combinations of velocity and risk). We plan to use three release channels: “fast”, “stable” and “enterprise.” Currently we are only using the “stable” namespace and plan to introduce the other release channels next year.

    Release channel namespace diagram

    Figure - Release channels and validation deployed as isolated namespaces

  3. Quality Engineering environment - In all environments we have a dedicated “qe” namespace that mirrors the deployment in the stable namespace channel. Our Quality Engineering team uses this namespace for final validation testing of releases.


The product namespaces typically share the same underlying physical resources (database, message queue), but use logical separation for each namespace - i.e. database with template-ized names such as “events_qe” for the “qe” namespace.

Key System Components

Each cluster contains a set of services and components that support non-functional aspects like operation, monitoring, maintenance, release, etc.

Below is brief outline of some of the key system components in our clusters:

  • Flux - Flux is a GitOps operator that syncs all configuration and kubernetes manifests from Git to cluster. This is the only way we allow any changes or deployments in the cluster (no manual changes aka click-ops).


A GitOps workflow for elastic kubernetes service

  • Services autoscaler - each service registers thresholds and algorithms for its autoscaling (via Kubernetes Horizontal Pod Autoscaler manifests). Services autoscaler is responsible for triggering an increase/decrease of service replicas based on traffic and autoscaling policies. We have custom metrics extensions to our telemetry systems (primarily Datadog).
  • Cluster autoscaler - cluster autoscaler is responsible for keeping sufficient underlying compute capacity resources (AWS EC2 Instances) based on the configured cluster autoscaler policy. After the service autoscaler adds or removes new service replicas, cluster autoscaler evaluates whether underlying node resource usage is within the configured policy (typically, we utilize 60-70% of resources to allow for fast elastic scaling in order to quickly respond to spikes in traffic) and can be triggered to add/remove cluster compute capacity (EC2 Instances).
  • Monitoring and logging - components and services of the monitoring and logging infrastructure (we use primarily Datadog, SumoLogic and NewRelic).
  • External Secrets controller - we use AWS Secrets Manager to manage and securely store service secrets. External secrets controller is responsible for securely synchronizing the secrets from AWS Secrets Manager to Kubernetes Secrets and provides secrets to running services.


External secrets controller

Figure - Storing and synchronizing secrets

  • Ingress - traffic ingress proxy, which ingests and authenticates all incoming requests. We use the traefik proxy with some custom middlewares.


Nodegroups

Services have different needs for their underlying (EC2) compute instances - in terms of resources (CPU, memory) and reliability. The specific groups of compute resources in a cluster are called nodegroups (as each EC2 instance is mapped to a Kubernetes node).

We have three main nodegroups in each cluster:

  • Base nodegroup - is dedicated for underlying key supporting components and for ingress proxies. This is our most critical group and requires the highest level of reliability. We typically keep resource utilization low and scale it aggressively up and conservatively down. This group is optimized for networking.
  • Service nodegroup - this group runs services that are involved in synchronous end-user and administrator flows. This is our “standard” nodegroup where most of the request/response services run. This group is optimized for both CPU and networking.
  • Compute nodegroup - this is dedicated for asynchronous jobs processing (like running asynchronous directory mappings), which are often more compute intensive. This group is optimized for CPU. As these jobs can be retried without any impact on users, we typically leverage AWS Spot instances to reduce costs. This choice - without additional effort - allows us to continuously practice Chaos Engineering, making sure that our jobs always tolerate failure of underlying nodes.


Each nodegroup is autoscaled independently. Nodegroups also serve as an additional level of resource and security isolation of components with different needs.

Base, service, and compute cluster nodegroups

Figure - Cluster nodegroups

Services

In each cluster, stateless (micro-)services are deployed as Kubernetes deployments (some supporting services are deployed as daemonsets).

Each service is run in multiple replicas and autoscaled depending on actual traffic, with a minimum of three replicas per service in a cluster. Replicas are equally spread between three availability zones.

Clusters are connected to their own regional queues and stateful services (i.e. databases). The connection to these stateful services is managed by network load balancers that spread traffic equally between availability zones - i.e. a service in Availability Zone A talks to all of the DBs in Availability Zones A,B, and C so even in the case any one of the database replica fails the whole cluster continues to work without interruption. We regularly test such scenarios in our live production clusters.

Services

Figure - One cluster with services and data layer spread across availability zones

Clusters are designed to be independent and isolated - there is no direct connection (intra-cluster call) allowed between clusters. All operations are designed so that one cluster never needs to directly reach any other cluster.

Login and admin clusters have no direct cross-cluster communication

Figure - No direct cross-cluster communication

We allow only indirect communication and only from the login to the admin cluster in order to store data in persistent storage (as described in Part 1).

Indirect communication from the login to the admin cluster

Figure - Indirect login to admin cluster communication

The Twelve-Factor App

Our microservices follow the Twelve-Factor App methodology created, pioneered and battle-tested by teams behind Heroku Cloud Application Platform.

The Twelve-Factor App is a methodology for building services that are suited for deployment on modern cloud platforms which allow for fast iterative development, flexible deployments, easy scaling, outstanding engineering and operational excellence.

Services Platform

We refer to the Services Platform as the combination of container images, libraries, configuration, underlying platform and complementary services that are provided to our engineering teams. The Services Platform enables them to:

  • Focus on business logic and not waste time with infrastructure or platform tasks.
  • Provide unified services operating system and runtime environment.
  • Solve difficult problems with operating scalable products once.
  • Plug-in to CI/CD and underlying operational platform.


The main components of the OneLogin Services Platform are:

  • Base (Docker) Images - unifies services OS and language-specific stack. We support Node.js and Ruby, with Node.js being our primary stack.


Example of Platform base image

Figure - Example of base image

  • Service Base - language-specific base server, base worker and libraries.
  • Container Platform and Platform Services - the orchestration service (EKS), network mesh as well as complementary services that enable common functionalities like telemetry, logging.


Diagram of the services platform architecture

Figure - Services Platform architecture

In the above picture, the green and red layers are provided by our OneLogin Platform team, and blue pieces are in control of our “Feature teams,” that own specific services and develop their business logic.

Summary

In this blog post, we covered some principles we used when designing our clusters - the essential building blocks of our OneLogin Hydra architecture.

We covered some key components and how we leveraged namespaces and nodegroups for achieving high quality, reliability and ease of development.

We also mentioned engineering and operational principles used in our services and how the feature and platform teams separate their concerns to collaboratively build OneLogin’s world class product!

OneLogin blog author
About the Author

As VP of Platform Engineering at OneLogin, Tomas loves to design, build and deliver easy to use and scalable cloud products. At OneLogin he has held multiple roles in engineering; most recently he has built up and now leads the Platform Engineering organization. He is architect of the Hydra Platform - a modern, highly reliable and scalable IDaaS platform. Prior to OneLogin, he co-founded and was CTO of Portadi - a startup later acquired by OneLogin.

View all posts by Tomas Soukup

OneLogin blog author
About the Author

As VP of Platform Engineering at OneLogin, Tomas loves to design, build and deliver easy to use and scalable cloud products. At OneLogin he has held multiple roles in engineering; most recently he has built up and now leads the Platform Engineering organization. He is architect of the Hydra Platform - a modern, highly reliable and scalable IDaaS platform. Prior to OneLogin, he co-founded and was CTO of Portadi - a startup later acquired by OneLogin.

View all posts by Tomas Soukup

Secure all your apps, users, and devices