Wednesday, August 13, 2014

Chaos Monkey 2.0... or 0.5?

Netflix began transitioning to host their services on AWS back in late 2009 / early 2010.  In 2010 they posted a very interesting article about their transition to the Amazon cloud.  A lot of interesting problems cropped up.  What they decided to do is something that much of the IT community is now familiar with - Chaos Monkey.  This service runs in the wild, randomly bringing down entire chunks of infrastructure without remorse.

Why did Netflix decide to do this?  They realized that these problems were going to happen, and happen constantly.  They asked themselves if they would be better off waiting for a failure to happen, then running around like chickens with their heads cut off, or would introducing failure, learning, and improving the architecture be a better approach?  The answer is unequivocally the latter.  The regular introduction of failures, when accompanied with learning and improvement, drastically improves quality.

This was pretty revolutionary in the tech space.  Why haven't we applied this more broadly in our organizations?  For instance, into organizational structures?  Organizational agility is, in large part, about the decentralization of organizational structures.  Check out Reuse Creates Bottlenecks for a great write-up on this topic.  In The New Gold Standard, Joseph Michelli writes about how the Ritz-Carlton has taken decentralization and empowerment to impressive extremes.  Each and every employee is empowered to spend up to $2000 per day per guest to improve their stay.  Do any of you work for very hierarchical organizations where the prevailing management style is command and control?  What tools do we have at our disposal for measuring our organization's ability to push decision making down to the lowest level possible?  Localized decision making is always significantly faster.  This is the heart of true agility.

Netflix's chaos monkey is really a disaster audit.  Why don't we introduce Chaos Monkey as an empowerment audit?  Imagine a common scenario in a typical organization.  You're getting ready to deploy a feature to production. As usual you e-mail all the managers necessary for production deployment approval, when you promptly receive out of office replies and realize that two of them are traveling on a business trip.  What do you do?  If you follow the process you delay releasing valuable customer features.  If you go ahead and deploy you risk the command and control powers that be coming down on you in force.

Here's where the beauty of injecting these type of faults into the system comes into play. What if those managers weren't truly out of the office? What if there was a way to inject this type of chaos into our daily routine?  Could we take all of the managers and senior leaders that are too involved in decisions that should be made at lower levels in the organization, and randomly sever their normal communication points? Email? Gone. Messaging? Gone. Desk phone? Gone. Company-issued cell? Gone.  How long would it take for command and control organizations to screech to a halt?  Would empowered teams hum along?

Leaders that embrace that concept can go on vacations for a week, two weeks, and things keep on humming.  They create a better work-life balance for both themselves and their employees.  If your organization has made a conscious effort to change culture, empower employees, and improve products and services, this could be a way to measure progress.  At the very least leaders approached to make a decision must constantly ask, "Could this decision have been made at a lower level in the organization?"

This concept could even be applied to knowledge management, as a "silo audit" of sorts.  Knowledge sharing within a team is critically important.  Ever heard of the bus number concept?  How many people on your team or in your org would have to get hit by a bus before progress would halt?*  I would guess that for a lot of teams and organizations that number is one.  What happens when that one person severs all communication ties for a period of time?  Better to practice and simulate this before it really happens!  Knowledge sharing, much like disaster recovery, must be automatic, and continuous.

We have a long way to go in this space, but maybe applying some of the same technical practices to our organizations will help us improve.  Or we could just buy a tool for that.....



* The bus number concept is obviously hyperbole (in most cases?), but this isn't that different from job transitions.  What happens when your bus #1 employee takes a different job.  Maybe it's within your company and that's a slight relief.  Or maybe they leave altogether.  Organization continuity planning is the knowledge and talent equivalent of disaster planning for critical IT systems.

Economies of Scale

Let's walk through the typical software development cycle for a new product.  Needs are identified, a team is put together, functionality built, and deployed.  As business needs evolve new features are developed, existing features are changed, and in general systems usually grow in size.  As the size of the system grows testing becomes a bottleneck for most companies.  In fact, companies that let this go on for too long end up with testing cycles that dwarf the development portion of the delivery process.  Why?  Testing needs to cover new features as well as existing features, so testing effort is theoretically equal to all past testing efforts plus the effort required to test whatever is new.  If manual, this obviously gets cumbersome quickly.

Enter test automation. The argument becomes clear to organizations that more of the testing process must be automated.  It's a process that is repeated over and over again.  Why would a company pay a tester to do perform the exact same task over and over again?  The logic is straightforward.  It seems to me that this is the reason that test automation practices are gaining a much larger foothold in companies in the last several years.

The Infrastructure as Code movement is much newer, however.  It seems to me that the argument for test automation (above) is not as applicable in the ops space.  If you consider the same product development cycle described above, but from an ops perspective, it might go something like this.  Needs are identified, teams request new infrastructure to host new solution(s), infrastructure is provisioned, new resources are folded into existing support processes.

"Infrastructure is provisioned".  Once.  Granted, in the case of Amazon, Netflix, Facebook, etc, this wouldn't hold true.  Everything they build is scaled massively.  That's not the case for a lot of companies, especially those in the Information Technology Dark Side.  Infrastructure is provisioned once and that's it.*  There is no economy of scale.

Now there is still an argument, it's just an argument that I think it harder to make and back up with concrete businessy goodness.  Testing?  Manual regression testing takes 1 month.  Our automation suite will take 4 minutes.  Boom.  Infrastructure?  Welllll, the automation is a repeatable, reliable process that will significantly improve our confidence in the provisioning and subsequent change process.  "That's all well and good, but you're telling me we need to write all this... what did you call it... infrastructure codeage?  That seems like a lot of work to stand up these 2 servers.  We'll just hammer them out."

I wholeheartedly believe that the real value in infrastructure as code + automation is the actual repeatable, reliable process that results in consistency.  The extremely beneficial by-product is that the result can scale across as many nodes as you want.  To me this is akin to the TDD argument.  A lot of very smart people pointed out that the real value in TDD is the improved code design.  The fact that the resulting tests form a regression suite is just icing on the cake.  But neither of those real reasons is an easy argument to make to management.

Cost-focused organizations and managers want economies of scale.  They affect the bottom line.  It seems to me that test automation is an easier argument to make.  What am I missing?  What's the business case for infrastructure as code / automation?  How do you frame it up in a way that connects to concrete business value?



* You could certainly make the argument that subsequent changes to the infrastructure should be vetted through an automated test suite process similar to the one I described for application code.  That's fair.  And I'm sure people do it.  That just feels even less tangible to me right now.