07 May 2011

Rambo Architecture

Amazon’s EC2 & EBS outage | Cloud Zone
I just read a post in Coding Horror which refers to a year old post in Netflix’s blog called “5 lessons we’ve learned using AWS [Amazon WebSevices]“. Netflix, in case you’re wondering survived Amazon’s outage and indeed, in lesson #3 they explain that if you want to survive failures you have to plan and constantly test for it:

3. The best way to avoid failure is to fail constantly. We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends. If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

That's a great quote. You got to build systems that expect dependencies to fail and test those scenarios.