Do you like this site? Remember to share it to all your friends on Facebook and Twitter!

Thursday, May 12, 2011

Net workers' nightmare came true. Lesson from April 21 outage of Amazon.com

The incident summary of April 21 outage of Amazon.com has been published for a while. It reminds me some typical nightmares of an Net worker like me.

The "prelude" of that incident was because of network execution error, as this:
..... The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. .....

I made mistakes while I was making configurations. I felt lucky that most of my mistakes are quickly found and easily recovered. Although I am skillful enough to be titled as an "expert", I can never guarantee that I would make no mistakes anymore!

I think Amazon.com has learned a lot from this incident. I like this statement:
We will audit our change process and increase the automation to prevent this mistake from happening in the future.
Automation is the key to minimize the possibility of human error, although it is not easy!
Do you like this post? You really should consider Subscribing by Email!


Related Posts with Thumbnails

No comments:

Post a Comment

Tip: you can also anonymously comment here.

Popular Posts