three blocks
Datacore Software

Analysis

Unchecked single bit errors forced S3 to its knees

posted on 26 July 2008 09:12


S3 bought down by corrupt server-to-server system state messages

Amazon's S3 (Simple Storage Service) service was halted on July 20th because of single bit error corruption of inter-server state messages, as Amazon tells it here.

The service suffered a 6 - 8 hour oputage on that Sunday and operators eventually had to close the system down and restart it.
The component servers in Amazon's world-wide set up exchange state messages before dealing with customer storage requests. This is done to identify failed servers and bypass them in a sort of fuzzy active-active clustering. However the state messages - 'gossiping' in Amazon's terms - were corrupted by single bit errors meaning that servers couldn't deal with them properly and spent most of their time trying to deal with the screwed-up state messages.

Unlike customers server messages and data which are checked for errors the server-to-server state messages were not. So the corruption, whose root cause is unknown, spread and spread, forcing the administrators hands until they shut the system down. An Amazon statement said: 'As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.'

Now the system software does error-check the 'gossip' messages and server gossip time has been reduced.

Amazon finishes up its statement with this text: 'Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.'

A zero-downtime S3 is the goal.

[Chris Mellor.]



tags:  cloud