Chaos Monkey saves Netflix11 Nov 2010
There has been a lot of Dedicated Server people jumping on the anti cloud bandwagon lately after Amazon’s cloud based AWS went down for a few days. Many big named sites like Foursquare and others were offline for quite some time but amazingly one company whom uses Amazon almost exclusively seemed to stand on it’s own without any issues, Netflix stayed online. How did they do it? The secret lies in the way they structured their networks, great programmers solve problems, superb developers avoid them all together.
According to Company Rep John Ciancutti, Netflix had build an application which they named Chaos Monkey. Without going too in depth, what this program basically does it randomly takes down Netflix Instances in an attempt to test the network and ensure that there is no single point of failure on it’s streaming service. While most companies would consider this suicide, Netflix embraced virtualized architecture early on but understood the draw backs of depending on a single network or provider to do everything for them. They made sure that their site and system was robust enough that it could jump from cloud instance to instance based on availability and traffic. This is why 20% of evening internet traffic can be attributed to Netflix streaming and customers never have to see a delay or drop in service.
Now you make ask how does a single developer do something like this? We’ll it’s simple, say for example you run a Python or PHP site using Memcache or MySQL as your database backend? Much like a Raid system expands it’s data across multiple hard drives, you can create multiple MySQL services across multiple instances and mirror much of your code as well. This is important because as people come to your site and traffic begins to take off, you will need to move them onto different servers with the same information and if any one server goes down, you have a full backup of everything and can be back online with a simple DNS pointer. If you’re using Amazon another great idea is to section your data in S3 and your server content in EC2, some CMS systems like WordPress and Drupal actually have plugin modules that let you automatically save your media to S3 so your site can elastically keep up with processing demands for high traffic use. If you’re building your application yourself, Amazon has a whole class on using their S3 API’s to send and pull data to their servers.
This should be a lesson to developers out there looking at the cloud, instead of running back into your holes, buying your own dedicated machines you should all embrace the future and face your fears of being down head on. There’s nothing wrong with wanting your own dedicated service, but industries rise and fall based on innovation and what works for one company may not work for future developers. If you are a start up and looking at your options, choose cloud, choose the future in whatever instance or service that may be. Yet know the downfalls and prepare for them, I think Chaos Monkey (a monkey reaking havoc on your network on purpose) is a fantastic idea. Much like modern software developers are using fuzzing techniques to test any and all vulnerabilities in their applications, website and network developers should use similar methods to constantly test their site or service to ensure no single point of failure on their networks. Only then can we really, truly live in the cloud without wires.