The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon’s sprawling network, according to a post-mortem from company engineers. The series of failures lasted for 15 hours and 32 minutes, Amazon said. Network intelligence company Ookla said its DownDetector service received more than 17 million reports of disrupted services offered by 3,500 organizations. The three biggest countries where reports originated were the US, the UK, and Germany. Snapchat, AWS, and Roblox were the most reported services affected. Ookla said the event was “among the largest internet outages on record for Downdetector.” It’s always DNS Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers’ control. The result can be unexpected behavior and potentially harmful failures. In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.” While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them. The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. As Amazon engineers explained:

Read Full Article

Continue reading the complete article on the original source