A massive AWS outage Monday that brought down some of the world’s most popular apps and services all started with a glitch.
The bug – which occurred when two automated systems were trying to update the same data simultaneously – snowballed into something significantly more serious that Amazon’s engineers scrambled to fix, the company said Thursday in a postmortem assessment.
The massive cloud service’s outage meant people couldn’t order food, communicate with hospital networks, access mobile banking, or connect with their security systems and smart home devices. Major global companies, including Netflix, Starbucks and United Airlines, were temporarily unable to give customers access to their online services.
“We apologize for the impact this event caused our customers,” Amazon said in a statement on the AWS website. “We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”
At a high level, the issue stemmed from two programs competing to write the same DNS entry – essentially a record in the internet’s phonebook – at the same time, which resulted in an empty entry. That threw multiple AWS services into disarray.
Continue reading the complete article on the original source