On Tuesday, several S3 servers shut down. Amazon AWS rarely fails, but when it does, it takes down a lot of servers with it. This is exactly what happened in North Virginia (US-EAST-1) region. Amazon recently reported that the failure in AWS was a result of human error.
They recently wrote a blog post in which they mentioned that an employee was debugging an issue with the billing system. Suddenly, out of the blue servers started going offline. The figures kept rising before he realized there was a problem. This is what sparked off a domino effect taking down servers one after the other.
The post read –
“Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from an S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable”
This mistake formed a huge part of the infrastructure. Therefore they made sure that people knew what went wrong. This move led them to clearly mention and explain what went wrong.
It affected major websites including Quora, Business Insider, Slack and Securities, and Exchange Commission, etc.
Amazon further continued on AWS –
“Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”
The company also said that they are working on something to ensure that any human error in future does not cause such a widespread issue. The employees will be able to move the server capacity henceforth if all goes well. Then any error in programming will not impact largely.
Again, this is not the first time such a mistake has occurred. Back in 2011, Amazon faced a similar disruption in Eastern US region. At that time it was a four-day blackout. However, the Amazon AWS is reliable in general. It goes through disruptions only from time to time. It can happen to any cloud service provider.