A recent Amazon service collapse shows that a simple user error can be enough to create disaster for businesses that rely on cloud storage applications.
On Feb. 28, Amazon Web Services, its cloud and data center business, went offline for several hours, impacting users of Amazon, Dropbox, Slack, Trello, Imgur, and other services.
After resolving the issue, the company broke down just what happened to its Simple Storage Service (“S3”): a typo. According to Amazon:
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”
As a result, the system required a full restart.
To protect against an error like this in the future, Amazon, along with other tech giants including Google and Microsoft, has been aggressively expanding its number of data centers around the world to ensure redundancy and avoid damaging outages.
How would your organization handle an outage like this? Emerging technologies like cloud storage are among the issues tackled in DRI’s IT/DR Workshop. You will learn about your responsibilities, the benefits and drawbacks of the tools and processes available to you, and how your organization can increase its preparedness. You’ll come away with the skills you need to create an IT/DR project plan and gain management approval for it. Click here for more information on upcoming courses.