Amazon Virginia数据中心故障回放

前段时间看到了Amazon 位于Virginia的数据中心down机，使得很多架设于Amazon云服务上的网站访问不了，引起了很大的反响和讨论，当时我认为几个小时内应该能够恢复正常，结果居然是以天计数的，下面列一下几个重要的时间点，并说一点自己的想法：

At 12:47 AM PDT on April 21st, incorrect traffic shift operation lead to disaster.

At 2:40 AM PDT on April 21st, the team deployed a change that disabled all new Create Volume requests in the affected Availability Zone, and by 2:50 AM PDT, latencies and error rates for all other EBS related APIs recovered.

By 5:30 AM PDT on April 21st, error rates and latencies again increased for EBS API calls across the Region.

At 8:20 AM PDT on April 21st, the team began disabling all communication between the degraded EBS cluster in the affected Availability Zone and the EBS control plane.

At 11:30AM PDT on April 21st, the team developed a way to prevent EBS servers in the degraded EBS cluster from futilely contacting other servers. Latencies and error rates for new EBS-backed EC2 instances declined rapidly and returned to near-normal at Noon PDT. About 13% of the volumes in the affected Availability Zone were in this “stuck” (out of service) state.

At 02:00AM PDT on April 22nd, the team successfully started adding significant amounts of new capacity and working through the replication backlog.

At 12:30PM PDT on April 22nd, all but about 2.2% of the volumes in the affected Availability Zone were restored.

At 11:30 AM PDT on April 23rd we began steadily processing the backlog.

At 6:15 PM PDT on April 23rd, API access to EBS resources was restored in the affected Availability Zone.

At 3:00 PM PDT on Arpil 24rd, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

0.07%对Amazon来说只是个数字，但是对于某些站点来说就是死亡，不知道Amazon会如何赔偿这些网站。

我在看整个recover过程的时候就像看一部惊险的小说，不知道Amazon的支持工程师如何度过这几天的，大概会是身心俱疲。

同时，这样的事情可能每时每刻在不同的公司上演，但是当Amazon发生的时候其意义又重大的多，因为Amazon已经成为了一个平台，换句话说，全球都在关注你。

回顾整个事故的过程，能够将其归咎于操作不当吗？似乎不能简单的这么认为，这个操作不当暴漏了ESB和panel设计上的一些不足和潜在的bug，这对以后Amazon也许持续良好的运行未必是件坏事，否则，在以后serve更多更大的应用时出现这个问题，成本就会高很多。不能抱着侥幸心理认为不过不是操作失误，事故就不会发生，所谓条条大路通罗马，只要问题存在，随着数据规模的膨胀，总有一个条件会通向那个bug，灾难总会出现的。

深究该事故的root cause，感觉Amazon没有很好的重视net partition的问题，至少在这个问题上测试不足，我之前看Google强调net partition的时候也不是特别在意，现在看来这些问题不常见，见了就要命。还有就是request处理的优先级问题，我想很多logic/proxy server处理程序都会碰到类似的问题，就是后端部分server block住的时候导致请求队列满而无法serve那些正常的请求，该问题业界也已经有很多解决办法，但是我想说的是，解决办法不是最重要的，重要的是你有没有意识到“这”是个问题。

http://aws.amazon.com/message/65648/