What is High Availability Cloud Computing & How Do I Achieve It?

What is high availability in cloud computing? High availability in the cloud is a computing infrastructure that allows a system to continue functioning, even when certain components fail.

What Is “High Availability”?

In an always-on digital world, availability is king. This reality doesn’t apply to retail or consumer-facing applications–researchers and engineers have turned to high-performance and always-available cloud computing to power their projects in areas like genomic sequencing, machine learning, and manufacturing and supply chain management.

Thus, saying a cloud infrastructure is “highly availability” actually means something more specific than just “usually available whenever we need it.” Instead, availability is typically rated on the number of “nines” represented in its percentage of uptime per day, week, and year:

  • 99% uptime = two nines
  • 99.9% uptime = three nines
  • 99.99% uptime = three nines

And so on. To frame these percentages, an HA cloud system with three nines would have Mean Downtime per day of roughly 1.44 minutes and an annual downtime of only about 43 minutes.

Other important metrics are Mean Time Between Failures (MTBF) and Recovery Time Objective (RTO) to measure high availability. The former is the expected time between system failures, while the latter is the estimated time it would take to repair system components based on planned and unplanned outages.

Additionally, the move from one level to the next is exponential. While promising two versus three nines might seem like a slight improvement, it’s a massive step up in uptime and performance.

The core principles of high availability cloud that provide these extreme levels of uptime are:

  • Eliminating Single Points of Failure: HA cloud systems reduce single points of failure through system redundancy, with the goal that no single system can fail and cause a loss of availability.
  • Redundancy and Failover: Because redundancy systems can also fail, a critical part of the HA cloud is implementing robust and reliable redundancy that does not have a single point of failure.
  • Orchestration: HA cloud systems must seamlessly and automatically route network traffic through storage and processing clusters such that performance is well maintained and, if there is a failure, users and applications experience little or no downtime.

What Are the Components of a High-Availability Cloud?

Cloud systems aren’t inherently high availability. They have to be built that way with the above principles in mind.

With that in mind, some mission-critical components play a role in providing enterprise users and researchers what can properly be called “high-availability” cloud systems:

  • Clusters: Cloud infrastructure is built on shared computing resources, organized as nodes, that are presented to users and applications as a single, unbroken computational environment. A cluster is a grouping of nodes that serve a logical purpose. For example, one cluster of computation nodes will serve user-facing applications to the public Internet, while another may provide high-availability storage. Using clusters can help maintain failover and redundancy standards through different organizations (see more below).
  • Backup and Recovery: Systems fail. It’s unavoidable. What differentiates an HA cloud system from traditional cloud systems is that should a cluster fail, the HA architecture acts to recover from the failure. This includes placing other identical nodes online and routing network traffic to avoid application downtime or storing system backups in fast archival storage that can be used for immediate disaster recovery.
  • Failure Detection: To support rapid backup and recovery in mission-critical HA systems, an automated and in-depth failure detection and response system should be in place. These systems will alert administrators to issues related to system health and, in most cases, automatically conduct recovery or load-balancing operations to avoid downtime. These automated operations can include re-routing traffic between clusters, putting new clusters online, and taking failed clusters offline for maintenance.
  • Load Balancing: Regardless of whether a failure has occurred or not, high-availability cloud systems must maintain appropriate performance at all times. That means using load-balancing operations to ensure that no single cluster is overwhelmed with work or traffic that would overwhelm it.

Cluster Formations

High-availability clusters come in several different types of organizations that help them maintain the uptime of the application or platform. However, two major categories of cluster configuration fit within the HA paradigm:

  • Active/Active: This organization represents an approach where two or more clusters making up an HA cloud system are always on and running the same service (application, storage, etc.) simultaneously. The cloud load balancer will split traffic between the two to maintain performance across both. This approach helps HA cloud systems scale by ensuring that there are always enough resources to support user demand.
  • Active/Passive: Conversely, if a cloud system has two or more clusters, some of which are available and some are not, then it’s called an “active/passive” structure. The passive cluster serves as a failover–should the active cluster go offline or fail, the passive cluster immediately comes online to take over. Of course, this setup requires the passive cluster to be quickly loaded and operational, calling for fast and scalable technology.

These two formations aren’t really that distinct. An HA cloud system can involve several active clusters operating simultaneously, serving the same resources and balancing demand, with passive nodes waiting on standby.

How Do Containers Related to HA Clusters?

Containers are virtualized applications and components that promote extreme portability and flexibility for developers on the cloud. These apps can prove highly modular, allowing developers to compartmentalize full applications or even microservices that power multiple related apps.

The container flexibility allows data scientists and developers to run apps and services across different cloud architectures. This includes additional flexibility around hybrid and multi-cloud environments. This includes running different containerized components across multiple clusters to power a single app.

Thus, containers can leverage HA cloud architecture and clusters to drastically increase app performance and availability. Furthermore, by strategically distributing app components across clusters, administrators can better maintain, repair, and update those components without disrupting the entire system–a critical feature for HA cloud systems.

Build Your High-Availability Cloud on WEKA

High-availability cloud is a resource-intensive infrastructure but one that’s nonetheless a necessity for high-demand computational processes and always-on applications. The right setup includes HPC cloud architecture and a solution that can support orchestration, load balancing, and cross-cluster computing.

With WEKA, you get the following features:

  • High-performance stateful data storage that simplifies the process of moving containerized workloads to the cloud or sharing data across multiple clusters
  • Industry-best GPUDirect performance (113 Gbps for a single DGX-2 and 162 Gbps for a single DGX A100)
  • In-flight and at-rest encryption for governance, risk, and compliance requirements
  • Agile access and management for edge, core, and cloud development
  • Scalability up to exabytes of storage across billions of files

Contact our experts today to learn how WEKA can be the bedrock of your high-availability cloud.