With millions of customers and a host of entry-level services that help companies quickly on-board the cloud, it’s no surprise that organizations are shifting mission-critical apps and services onto the elastic compute cloud (EC2). But even Amazon isn’t immune to hardware failures, natural disasters or other system outages.
First up? Defining the difference between redundant and fault-tolerant solutions. While the terms are certainly related — and often used interchangeably — they’re not exactly the same. And although there’s no hard-and-fast rule regarding the definitions, the commonly accepted answer goes like this:
Put simply, redundant means having more than one of something in case the first instance fails. Having two disks on the same system that are regularly backed up makes them redundant, since if one fails the other can pick up the slack. If the entire system fails, however, both disks are useless. This is the role of fault-tolerance, to keep the system as a whole operating even if portions of the system fail. So, how does this apply to EC2 and the Amazon cloud?
For many companies, the cloud acts as both home for applications and a flexible DR service in the event of local systems failure. But what happens when the cloud itself goes down? Like all cloud providers, Amazon has experienced outages due to weather, power failures and other disasters; while the company promises 99.95 percent uptime for its compute instances, this still equates to approximately four hours of downtime per year. Use of Amazon as a DR solution is now both possible and recommended — but isn’t perfect. To address this issue, EC2 comes with several tools that can help companies increase both their total redundancy and overall fault tolerance.
How do companies address the issue of redundancy in their EC2 instances? It starts with availability zones (AZs). These zones are divided by region — meaning if you’re on the West Coast of the United States you’ll have a choice of multiple zones along the coast that are independently powered and cooled, and have their own network and security architectures. AZs are insulated from the failures of other zones in the group, making them a simple form of redundancy. By replicating your EC2 instance across multiple AZs, you significantly reduce the chance of total outage or failure.
It’s worth noting that bandwidth across zone boundaries costs $0.01/GB, which is a fraction of the cost of Internet traffic at large but is important to consider when calculating cloud costs. It’s also important to remember that information transfer does have an upper limit bounded by the speed of light, meaning that if you’re using two geographically distant AZs to house your EC2 instances you may experience some latency in the event of a failure.
As noted by the AWS Reference Architecture for Fault Tolerance and High Availability, while higher-level services such as the Amazon Simple Storage Service (S3), Amazon SimpleDB, Simple Queue Service (SQS) and Elastic Load Balancing (ELB) are inherently fault-tolerant, EC2 instances come with a number of tools that must be properly used to achieve overall fault tolerance.
For example, employing ELB can help migrate workloads off failed EC2 instances and ensure you’re not wasting resources, while creating an Auto Scaling group in addition to an existing ELB load balancer will automatically terminate “unhealthy” instances and launch new ones. Also critical are the use of elastic IP addresses, which are public IP addresses that can be mapped to any EC2 instance in the same region, since they’re associated with your AWS account and not the instance itself. In the event of a sudden EC2 failure, elastic IP lets you shift network requests and traffic in under two minutes. It’s also a good idea to make use of Snapshots in combination with S3 — by taking regular point-in-time snapshots of your EC2 instance, saving them to S3 and replicating them across multiple AZs, it’s possible to reduce the impact of unexpected or emerging faults.
Mission-critical workloads now have a place in Amazon’s EC2 offering. Ensuring the high availability demanded by these workloads, however, means making best use of both redundant and fault-tolerant tools included with any elastic compute cloud instance.