Release The Chaos Monkeys: Intentionally Creating Failures

Young baboons playing
Are you ready to let them run wild among your machines?
(image: Tambako The Jaguar, license: CC BY-ND 2.0)

My Grandfather gave me one piece of romantic advice: before getting serious, take a prospective mate camping. Not RV’ing with all its creature comforts, but rather roughing it on a backwoods adventure. He said that hopefully it would rain, your sleeping bag would get soaked, you’d arrive at the campsite after sunset and be forced to set up in the dark, the fire would be impossible to start, and large ravenous bears would eat your food. The upside wouldn’t end there: you’d also get to see your partner at their worst and get a glimpse of what they were really made of. Ah, those hardy Depression-era values. While courtship doesn’t have to include a trip down the river, Deliverance-style, I always thought his scheme was a good idea. It’s best to know how a person will react in advance of when you might really need them.

Once again troubleshooting mirrors life: it’s also true that important knowledge about a machine is gained when seeing it at its worst, and best obtained before it matters. Like my Grandfather’s camping regimen, there are ways to learn this critical information about a system, in advance. That’s right, you don’t have to wait for a breakdown to occur to understand what something will do under duress. We’ll take a lesson from the world of product testing, where designers concoct shenanigans to see how a new model will react to a variety of conditions, trying to get it to fail before it gets in a customer’s hands. While this is something that you might normally associate with manufacturing, I want you to understand that it’s also a tool for the troubleshooter.

Monkey On Your Back

When it comes to creating mischief to foster understanding, the entertainment company Netflix has one of the most advanced systems ever conceived. When Netflix moved their services into the “cloud,” they encountered some unique problems. Amazon’s Web Services (aka, AWS, a cloud computing environment) offered savings to the bottom line, but presented some challenges that forced them to rethink the design of some of their systems. The details aren’t important, but suffice it to say that the co-tenancy of a shared resources environment includes a more variable performance profile and lowered reliability versus what the Netflix engineers were used to in their own data centers.

Netflix took this new way of life and embraced it:

The best way to avoid failure is to fail constantly.
We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.
If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
“5 Lessons We’ve Learned Using AWS.”

But, we haven’t gotten to the best part. Out of this desire to be truly resilient, the Chaos Monkey was born:

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
“5 Lessons We’ve Learned Using AWS.”

That’s right, the Chaos Monkey is a program that is constantly scurrying about, shutting things off. Not within the safe padded walls of a test environment, but in Netflix’s live, customer-facing service. I want you to understand just how cool and unusual this is: it’s like hiring some crazy drunk guy to run around your workplace, his shirt off, screaming obscenities, flipping switches, slamming doors, and knocking things off desks and shelves. And no one cares because your processes are so robust.

In addition to the challenges of moving their service to a shared environment, Netflix has another problem: at peak times, their service has been responsible for 1/3 of all Internet traffic in North America. One third! Netflix is operating at a scale that is hard to test. How does one simulate that much Internet traffic? Exactly, you don’t. Netflix’s test environment will always be a toy compared to the real thing, so the Chaos Monkey is even more vital. Without something like it, there wouldn’t be any way to definitively know the resiliency of their infrastructure.

It’s Yours, So You Can Wreck It

Netflix has some special challenges, and the cookie-cutter building blocks of a cloud computing environment makes the Chaos Monkey a feasible option for them. You’re probably thinking, there’s no way I’d introduce such a thing into my work environment! And of course, you’d be right: unless you’re prepared for it, unleashing a Chaos Monkey in your workplace would put you on the fast track to going out of business. However, there are lessons and strategies to be extracted from the concept.

First off, just get used to the notion that stress testing is an option for your machines, tools, etc. As noted before, this is a way of life for manufacturers who want to protect their reputations; they need to know what’s going to happen when people actually use their products. Plus, consumers demand this information: they want to understand if a product is going to be suitable for their particular use. While the cover of a marketing brochure may feature beautiful people, the “technical specifications” section is vital for making a buying decision. These “technical specifications” come from somewhere—testing!

While manufacturers are well versed in testing, I feel it’s something many consumers don’t consider. The product is unwrapped from its beautiful packaging. There it sits, not to be messed with, only to be used as intended. Maybe some of this reluctance is a fear of loss: you paid your hard-earned money for a new thingamajig and so the last thing you want to do is to harm it while testing its limits. However, if lives or livelihoods are dependent upon something, a higher standard of certainty is required.

Stirring Up Trouble

You might not be as brave as Netflix: it takes a lot of preparation and the right circumstances to make mischief in your production environment. However, there still are plenty of opportunities to know your systems by seeing them at their worst, albeit in safer ways. Consider making a test environment, with duplicates of the machines you use in production, so that you can muck around and learn without risking disruptions to your business. What happens if you shut off a particular subsystem or reintroduce a known broken part? Try it and observe the results.

One major caveat is to understand that test environments are, by definition, not the same as the real thing. Knowing when it’s appropriate to extrapolate your results, to make predictions about what will happen in real life, can be extremely difficult: history is littered with great ideas that tested well, but ultimately didn’t work. The only generic advice one can give is that tests should closely resemble the real conditions, in all possible dimensions: time, machines, usage, personnel, environment, and context. This list is a mirror of the one presented in “Duplicate The Problem.” Both testing and problem duplication share the same of goal of trying to match the details in which problems originate, that original context of the production environment.

We did a lot of testing at Discovery Mining, mostly of the non-destructive variety. We also heavily vetted new equipment before putting it into service: for example, running a new server through its paces, perhaps for as long as a week, before trusting it in our infrastructure. Later on, we got our vendors to build our tests into their manufacturing processes. We’d order a new machine, and it would be tested and broken-in to our specifications before it was shipped to us.

Speaking of which, a break-in period is a great idea for new or recently repaired machines, especially if it can be done before a machine is deployed for real work. Statistically, a good number of failures will be clustered around those initial hours of usage. That’s where manufacturing defects, installation problems, and configuration errors first show themselves.

If you designed a machine yourself, or have cobbled together something in the style of a systems integrator, then it’s up to you to understand its limits. The software and hardware industries have long used a technique called “fault injection.” This is where errors are purposefully introduced to a system and the results are observed. This process is applicable to both the digital and analog worlds: whether it’s sending bad data to an API, shorting a connection on a circuit board, or loosening a belt in an engine, the principle is the same.

So, here’s to monkeying around! Go forth and create some controlled mischief.

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: