Feb 27, 2017

AWS Availability: How to Achieve Fault Tolerance and Redundancy in EC2

Paul Painter, Director, Solutions Engineering

Cloud | Data Centers

With millions of customers and a host of entry-level services that help companies quickly on-board the cloud, it’s no surprise that organizations are shifting mission-critical apps and services onto the elastic compute cloud (EC2). According to AWS (Amazon Web Services), “Amazon Elastic Compute Cloud (Amazon EC2) provides computing resources, literally server instances, that you use to build and host your software systems.

Amazon EC2 is a natural entry point to AWS for your application development. You can build a highly reliable and fault-tolerant system using multiple EC2 instances and ancillary services such as Auto Scaling and Elastic Load Balancing.”

Fault Tolerance & Redundancy: The Same, But Different

First up? Defining the difference between redundant and fault-tolerant solutions. While the terms are certainly related — and often used interchangeably — they’re not exactly the same. And although there’s no hard-and-fast rule regarding the definitions, the commonly accepted answer goes like this:

Components — such as disks, racks or servers — are redundant.
Systems — such as disk arrays or cloud computing networks — are fault tolerant.

Put simply, redundant means having more than one of something in case the first instance fails. Having two disks on the same system that are regularly backed up makes them redundant, since if one fails the other can pick up the slack. If the entire system fails, however, both disks are useless. This is the role of fault-tolerance, to keep the system as a whole operating even if portions of the system fail.

According to AWS, “Fault-tolerance is the ability for a system to remain in operation even if some of the components used to build the system fail.” The AWS platform enables you to build fault-tolerant systems that operate with a minimal amount of human interaction and up-front financial investment.

So, how does this apply to EC2 and the Amazon cloud?

Saving Grace

For many companies, the cloud acts as both home for applications and a flexible DR service in the event of local systems failure. But what happens when the cloud itself goes down? Like all cloud providers, Amazon has experienced outages due to weather, power failures and other disasters; while the company promises 99.95 percent uptime for its compute instances, this still equates to approximately four hours of downtime per year.

Use of Amazon as a DR solution is now both possible and recommended — but isn’t perfect. To address this issue, EC2 comes with several tools that can help companies increase both their total redundancy and overall fault tolerance. EC2’s specific features to assist with this include availability zones, elastic IP addresses, and snapshots, that a fault tolerant and highly available system must take advantage of and use correctly.

Ramping Up AWS Redundancy

How do companies address the issue of redundancy in their EC2 instances? It starts with availability zones (AZs). These zones are divided by region — meaning if you’re on the West Coast of the United States you’ll have a choice of multiple zones along the coast that are independently powered and cooled, and have their own network and security architectures. A

Zs are insulated from the failures of other zones in the group, making them a simple form of redundancy. By replicating your EC2 instance across multiple AZs, you significantly reduce the chance of total outage or failure.

It’s worth noting that bandwidth across zone boundaries costs $0.01/GB, which is a fraction of the cost of Internet traffic at large but is important to consider when calculating cloud costs. It’s also important to remember that information transfer does have an upper limit bounded by the speed of light, meaning that if you’re using two geographically distant AZs to house your EC2 instances you may experience some latency in the event of a failure.

Amazon Web Services are available in geographic Regions and with multiple Availability zones (AZs) within a region, which provide easy access to redundant deployment locations.

Finding Fault Tolerance

As noted by the AWS Reference Architecture for Fault Tolerance and High Availability, while higher-level services such as the Amazon Simple Storage Service (S3), Amazon SimpleDB, Simple Queue Service (SQS) and Elastic Load Balancing (ELB) are inherently fault-tolerant, EC2 instances come with a number of tools that must be properly used to achieve overall fault tolerance.

For example, employing ELB can help migrate workloads off failed EC2 instances and ensure you’re not wasting resources, while creating an Auto Scaling group in addition to an existing ELB load balancer will automatically terminate “unhealthy” instances and launch new ones. Also critical are the use of elastic IP addresses, which are public IP addresses that can be mapped to any EC2 instance in the same region, since they’re associated with your AWS account and not the instance itself.

In the event of a sudden EC2 failure, elastic IP lets you shift network requests and traffic in under two minutes. It’s also a good idea to make use of Snapshots in combination with S3 — by taking regular point-in-time snapshots of your EC2 instance, saving them to S3 and replicating them across multiple AZs, it’s possible to reduce the impact of unexpected or emerging faults.

Mission-critical workloads now have a place in Amazon’s EC2 offering. Ensuring the high availability demanded by these workloads, however, means making best use of both redundant and fault-tolerant tools included with any elastic compute cloud instance.