The "prelude" of that incident was because of network execution error, as this:
..... The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. .....
I made mistakes while I was making configurations. I felt lucky that most of my mistakes are quickly found and easily recovered. Although I am skillful enough to be titled as an "expert", I can never guarantee that I would make no mistakes anymore!
I think Amazon.com has learned a lot from this incident. I like this statement:
We will audit our change process and increase the automation to prevent this mistake from happening in the future.Automation is the key to minimize the possibility of human error, although it is not easy!
No comments:
Post a Comment
Tip: you can also anonymously comment here.