Today’s sticky topic is that of SLAs.

As stated in their EC2 and EBS SLA, “AWS will use commercially reasonable efforts to make Amazon EC2 and Amazon EBS each available with a Monthly Uptime Percentage (defined below) of at least 99.95%”. If an SLA is not met a percentage service credit is given, not a refund. An outage is thus:

  • Service Elevator (Sam Howzit via Flickr)

    EC2 outage – all of your instances have no external connectivity and this is occurring in more than just one Availability Zone in a particular Region. There is no per-instance SLA target.

  • EBS outage – all of your attached volumes perform zero read write IO, with pending IO in the queue.

The SLA was only recently updated to include EBS. A failure in EBS precipitated some of the more infamous AWS failures.  It’s no surprising as many AWS services depend on EBS  (Elastic Load Balancer, Relational Database Service, Elastic Beanstalk and others) so when EBS fails they fail.

AWS makes the following SLA commitments:

[table th=”1″]

AWS service, SLA, Notes

EC2/EBS,99.95%,

S3,99.9%,measured by the error rate percentage

RDS,99.95%,only applies to Multi-AW RDS instances

Route 53,100%,

[/table]

In regards to the S3 SLA, a service credit is given when uptime drops below 99.9% but even when uptime drops to 99%, only a 25% credit is given. This is all rather interesting because S3 is actually designed for 99.99% availability.

All of the AWS SLAs apply within a single region so if you require a better SLA you need to spread your application across multiple Amazon regions, or other providers. This reduces the likelihood of an outage, based on common sense and history, but Amazon make no actual uptime commitment in this case.

If you decide you need two or more regions to meet your availability target and you live pretty much anywhere apart from the USA you’re sort of stuck – especially if you have data sovereignty and latency requirements – because most countries only have the one region.

The gold standard in SLAs is the famous five-nines (99.999%), which amounts to 5 minutes of outage each year. This has been around, and achievable, since the 80s. Why can AWS only commit to 99.95% (about 4 hours a year outage)? Well it’s mainly due to complexity. Getting five nines reliability out of a single server, or a single program, isn’t as hard because there’s only one part to control. Cloud computing has many moving parts. Faced with this reality, many people are asking the question, “Do I really need five-nines?” Good question. A lot of the time probably not.

So accept the SLA limitations of AWS, if you can, and move on because if you’re not happy with them it’s not like you can say anything publicly. As it states in the Customer Agreement, “You will not issue any press release or make any other public communication with respect to this Agreement or your use of the Service Offerings”.

The saving grace is that Amazon uses AWS itself and in the past has been more generous to its customers than required by its own SLAs when there were significant outages.

[stextbox id=”info”]In many presentation slides Amazon mentions the durability of S3 being 99.999999999%. Eleven nines! As they state in there FAQ, “if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years.” A big number. Very impressive. Pretty pointless. S3 may go down but they’re pretty sure your data is safe. Durability is about data loss and not about service availability.

The important part is that S3 is designed to survive the loss of data in two data centres. You can purchase reduced redundancy storage (S3 RRS) to save on cost, but this is designed to only survive the loss of one data centre, which is still pretty good.

In any case, I think you, or your application, are probably more likely to stuff up your data. Actually with eleven nines it’s probably more likely that your business, AWS or the world economy, will fail such is the pointlessness of planning to eleven nines![/stextbox]