Wednesday, August 13, 2014

Chaos Monkey 2.0... or 0.5?

Netflix began transitioning to host their services on AWS back in late 2009 / early 2010.  In 2010 they posted a very interesting article about their transition to the Amazon cloud.  A lot of interesting problems cropped up.  What they decided to do is something that much of the IT community is now familiar with - Chaos Monkey.  This service runs in the wild, randomly bringing down entire chunks of infrastructure without remorse.

Why did Netflix decide to do this?  They realized that these problems were going to happen, and happen constantly.  They asked themselves if they would be better off waiting for a failure to happen, then running around like chickens with their heads cut off, or would introducing failure, learning, and improving the architecture be a better approach?  The answer is unequivocally the latter.  The regular introduction of failures, when accompanied with learning and improvement, drastically improves quality.

This was pretty revolutionary in the tech space.  Why haven't we applied this more broadly in our organizations?  For instance, into organizational structures?  Organizational agility is, in large part, about the decentralization of organizational structures.  Check out Reuse Creates Bottlenecks for a great write-up on this topic.  In The New Gold Standard, Joseph Michelli writes about how the Ritz-Carlton has taken decentralization and empowerment to impressive extremes.  Each and every employee is empowered to spend up to $2000 per day per guest to improve their stay.  Do any of you work for very hierarchical organizations where the prevailing management style is command and control?  What tools do we have at our disposal for measuring our organization's ability to push decision making down to the lowest level possible?  Localized decision making is always significantly faster.  This is the heart of true agility.

Netflix's chaos monkey is really a disaster audit.  Why don't we introduce Chaos Monkey as an empowerment audit?  Imagine a common scenario in a typical organization.  You're getting ready to deploy a feature to production. As usual you e-mail all the managers necessary for production deployment approval, when you promptly receive out of office replies and realize that two of them are traveling on a business trip.  What do you do?  If you follow the process you delay releasing valuable customer features.  If you go ahead and deploy you risk the command and control powers that be coming down on you in force.

Here's where the beauty of injecting these type of faults into the system comes into play. What if those managers weren't truly out of the office? What if there was a way to inject this type of chaos into our daily routine?  Could we take all of the managers and senior leaders that are too involved in decisions that should be made at lower levels in the organization, and randomly sever their normal communication points? Email? Gone. Messaging? Gone. Desk phone? Gone. Company-issued cell? Gone.  How long would it take for command and control organizations to screech to a halt?  Would empowered teams hum along?

Leaders that embrace that concept can go on vacations for a week, two weeks, and things keep on humming.  They create a better work-life balance for both themselves and their employees.  If your organization has made a conscious effort to change culture, empower employees, and improve products and services, this could be a way to measure progress.  At the very least leaders approached to make a decision must constantly ask, "Could this decision have been made at a lower level in the organization?"

This concept could even be applied to knowledge management, as a "silo audit" of sorts.  Knowledge sharing within a team is critically important.  Ever heard of the bus number concept?  How many people on your team or in your org would have to get hit by a bus before progress would halt?*  I would guess that for a lot of teams and organizations that number is one.  What happens when that one person severs all communication ties for a period of time?  Better to practice and simulate this before it really happens!  Knowledge sharing, much like disaster recovery, must be automatic, and continuous.

We have a long way to go in this space, but maybe applying some of the same technical practices to our organizations will help us improve.  Or we could just buy a tool for that.....



* The bus number concept is obviously hyperbole (in most cases?), but this isn't that different from job transitions.  What happens when your bus #1 employee takes a different job.  Maybe it's within your company and that's a slight relief.  Or maybe they leave altogether.  Organization continuity planning is the knowledge and talent equivalent of disaster planning for critical IT systems.

No comments:

Post a Comment