Amazon Virginia数据中心故障回放

前段时间看到了Amazon 位于Virginia的数据中心down机,使得很多架设于Amazon云服务上的网站访问不了,引起了很大的反响和讨论,当时我认为几个小时内应该能够恢复正常,结果居然是以天计数的,下面列一下几个重要的时间点,并说一点自己的想法:

At 12:47 AM PDT on April 21st, incorrect traffic shift operation lead to disaster.

At 2:40 AM PDT on April 21st, the team deployed a change that disabled all new Create Volume requests in the affected Availability Zone, and by 2:50 AM PDT, latencies and error rates for all other EBS related APIs recovered.

By 5:30 AM PDT on April 21st, error rates and latencies again increased for EBS API calls across the Region.

At 8:20 AM PDT on April 21st, the team began disabling all communication between the degraded EBS cluster in the affected Availability Zone and the EBS control plane.

At 11:30AM PDT on April 21st, the team developed a way to prevent EBS servers in the degraded EBS cluster from futilely contacting other servers. Latencies and error rates for new EBS-backed EC2 instances declined rapidly and returned to near-normal at Noon PDT. About 13% of the volumes in the affected Availability Zone were in this “stuck” (out of service) state.

At 02:00AM PDT on April 22nd, the team successfully started adding significant amounts of new capacity and working through the replication backlog.

At 12:30PM PDT on April 22nd, all but about 2.2% of the volumes in the affected Availability Zone were restored.

At 11:30 AM PDT on April 23rd we began steadily processing the backlog.

At 6:15 PM PDT on April 23rd, API access to EBS resources was restored in the affected Availability Zone.

At 3:00 PM PDT on Arpil 24rd, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

0.07%对Amazon来说只是个数字,但是对于某些站点来说就是死亡,不知道Amazon会如何赔偿这些网站。

我在看整个recover过程的时候就像看一部惊险的小说,不知道Amazon的支持工程师如何度过这几天的,大概会是身心俱疲。

同时,这样的事情可能每时每刻在不同的公司上演,但是当Amazon发生的时候其意义又重大的多,因为Amazon已经成为了一个平台,换句话说,全球都在关注你。

回顾整个事故的过程,能够将其归咎于操作不当吗?似乎不能简单的这么认为,这个操作不当暴漏了ESB和panel设计上的一些不足和潜在的bug,这对以后Amazon也许持续良好的运行未必是件坏事,否则,在以后serve更多更大的应用时出现这个问题,成本就会高很多。不能抱着侥幸心理认为不过不是操作失误,事故就不会发生,所谓条条大路通罗马,只要问题存在,随着数据规模的膨胀,总有一个条件会通向那个bug,灾难总会出现的。

深究该事故的root cause,感觉Amazon没有很好的重视net partition的问题,至少在这个问题上测试不足,我之前看Google强调net partition的时候也不是特别在意,现在看来这些问题不常见,见了就要命。还有就是request处理的优先级问题,我想很多logic/proxy server处理程序都会碰到类似的问题,就是后端部分server block住的时候导致请求队列满而无法serve那些正常的请求,该问题业界也已经有很多解决办法,但是我想说的是,解决办法不是最重要的,重要的是你有没有意识到“这”是个问题。

http://aws.amazon.com/message/65648/

原文地址:https://www.cnblogs.com/raymondshiquan/p/2033873.html