Tuesday, October 28, 2014

Agility, Technical Leadership and the so-called Talent Shortage

My exposure to the agile community has not been super broad, but from what I have seen people tend to talk about agile in two forms.  Or at least I perceive two distinct discussions.  There is the "process" side of agile.  This includes things like planning, communication, team norms, managing deliverables, prioritization, etc.  Then there's what I call the technical practices side of agile.  This includes practices like continuous integration, test first development, automation, pairing, etc. - mostly those practices borne out of extreme programming (XP).

I'm a whole-hearted believer in these technical practices.  I believe it's these practices that form the foundation for agility.  You can do XP without the agile processes, but you can't be agile without XP.  Even doing XP in a pure waterfall world would yield huge productivity gains.  For that reason I mostly equate one's agility with one's strength in technical practices.  That's not to say there aren't significant gains to be had with process change.  You just can't get functionality in the hands of customers faster than you can build, test, and release it.  And the speed of your build, test, and release cycle is equivalent to how much of it you've automated.

A few years ago Andy Singleton posted Tech Leads Will Rule the World.  I've had it bookmarked ever since.  I liked what it had to say then.  I firmly believe it now.  Businesses are desperately fighting for agility as competition continues to increase and as software disrupts our world.  To achieve the kind of agility that is so critical now and for future business viability, technical practices could not be more important.

When it comes to adopting technical practices there is no one more important than tech leads.  These are the key influencers with the ability to set team norms, and most importantly, the ability to lead by example.  The only way to successful adoption is strong technical leadership.  The quickest way to peril is poor technical leadership.

But how do we all get there?  Especially with all this talk of the technical talent shortage.  Andrew Clay Shafer has taken a strong position on this topic.  He's well worth listening to.  Yes, we need to attract great talent.  First and foremost we need to look internally and treasure the strong technical leadership that we have today.  Your tech leads have a profound impact on the culture of the teams with which they work.

Do you know where your true tech leads are today?  If you're not sure look no further than your highest performing teams (they'll have the best technical practices).  And your "tech leads", well, you know how to identify them now too.

Friday, October 24, 2014

DevOps is Blue Ocean

I am fortunate to have recently attended the inaugural DevOps Enterprise Summit (DOES).  The event consisted of ~600 tech professionals from around the world.  It had an aspect to it that I really liked.  The event was not all about rockstar, industry-leading companies and professionals telling their stories.  Many presenters were and are change agents and leaders from larger companies with slower-to-change cultures.  This resulted in a unique, and fantastic conference.

On my way to the conference I decided to read a book that was given to me over a year ago – one that I had not yet prioritized high enough to read.  I’m glad I adjusted my backlog.  The book is called Blue Ocean Strategy and was published in 2005, so I’m a bit behind.  The premise of the book is basically this:

Companies spend most of their time competing in “red oceans”.  Red oceans are where the blood is.  The market is very competitive, profit margins have been driven down, and in order to remain competitive you have to scale.  The book advocates looking for and executing blue ocean strategies.  Blue oceans are where there is a gap in the competition.  There aren’t the same competitive challenges, and therefore the opportunity for massive growth still exists.  This may seem obvious, but the implementation is often not.

It’s easiest to illustrate with an example.  The first one in the book is about Cirque de Soleil.  Traditional circuses compete on a variety of characteristics that include (I’m sure I’m missing some):  price, use of animals, venue, aisle concessions, and number of simultaneous acts.  Price is low, animals are used frequently, venue is unimportant, aisle concessions help with revenue, and multiple simultaneous acts is considered important.  Cirque took these characteristics, eliminated some, reduced some, increased others, and added new ones, creating a whole new model.  They positioned their offering as a theater-going experience, allowing for a higher price.  They were able to do this by changing the venue, making it a higher end experience.  To align with the environment change they ditched the aisle concessions, but the revenue loss was more than offset by cost reductions.  Animal use in the circus makes some people uncomfortable anyway, and is often the most expensive part of the traditional circus.  They eliminated this all together.  And rather than put on multiple costly acts at once, which tends to overstimulate the audience and increase costs, they stuck to one.

You may have observed that with the changes Cirque made they actually increased revenue and reduced costs at the same time.  This is a primary goal in Blue Ocean Strategy.  And something we can and should endeavor to do outside of pure business ventures.  We should seek out win-win opportunities in the way we do our work too.

Typical infrastructure / operations management, over time, becomes a liability in most companies.  There is no better way to understand this than to read The Phoenix Project.  Or to ask someone who has worked in the pure Ops space (i.e. no DevOps).  At the conference Gene Kim shared the remarks of a colleague:



Support work is through the roof.  There is little (or no) time for long-term value-add work.  And the quality of life is terrible.  So from a business perspective you have high costs, low throughput, and low employee engagement. 

DevOps is the blue ocean.  By collaborating, and applying engineering practices to infrastructure management we can simultaneously achieve low costs, high throughput, and high employee engagement.  How often do businesses find opportunities like this?


So let’s get moving in the right direction.  Many companies are well on their way.  I was blown away to discover that one area within the Department of Homeland Security is deploying to the cloud with solid DevOps practices.  There is some great starter information in the 2014 State of DevOps Report, including business justifications and practices that result in better business outcomes.

Wednesday, August 13, 2014

Chaos Monkey 2.0... or 0.5?

Netflix began transitioning to host their services on AWS back in late 2009 / early 2010.  In 2010 they posted a very interesting article about their transition to the Amazon cloud.  A lot of interesting problems cropped up.  What they decided to do is something that much of the IT community is now familiar with - Chaos Monkey.  This service runs in the wild, randomly bringing down entire chunks of infrastructure without remorse.

Why did Netflix decide to do this?  They realized that these problems were going to happen, and happen constantly.  They asked themselves if they would be better off waiting for a failure to happen, then running around like chickens with their heads cut off, or would introducing failure, learning, and improving the architecture be a better approach?  The answer is unequivocally the latter.  The regular introduction of failures, when accompanied with learning and improvement, drastically improves quality.

This was pretty revolutionary in the tech space.  Why haven't we applied this more broadly in our organizations?  For instance, into organizational structures?  Organizational agility is, in large part, about the decentralization of organizational structures.  Check out Reuse Creates Bottlenecks for a great write-up on this topic.  In The New Gold Standard, Joseph Michelli writes about how the Ritz-Carlton has taken decentralization and empowerment to impressive extremes.  Each and every employee is empowered to spend up to $2000 per day per guest to improve their stay.  Do any of you work for very hierarchical organizations where the prevailing management style is command and control?  What tools do we have at our disposal for measuring our organization's ability to push decision making down to the lowest level possible?  Localized decision making is always significantly faster.  This is the heart of true agility.

Netflix's chaos monkey is really a disaster audit.  Why don't we introduce Chaos Monkey as an empowerment audit?  Imagine a common scenario in a typical organization.  You're getting ready to deploy a feature to production. As usual you e-mail all the managers necessary for production deployment approval, when you promptly receive out of office replies and realize that two of them are traveling on a business trip.  What do you do?  If you follow the process you delay releasing valuable customer features.  If you go ahead and deploy you risk the command and control powers that be coming down on you in force.

Here's where the beauty of injecting these type of faults into the system comes into play. What if those managers weren't truly out of the office? What if there was a way to inject this type of chaos into our daily routine?  Could we take all of the managers and senior leaders that are too involved in decisions that should be made at lower levels in the organization, and randomly sever their normal communication points? Email? Gone. Messaging? Gone. Desk phone? Gone. Company-issued cell? Gone.  How long would it take for command and control organizations to screech to a halt?  Would empowered teams hum along?

Leaders that embrace that concept can go on vacations for a week, two weeks, and things keep on humming.  They create a better work-life balance for both themselves and their employees.  If your organization has made a conscious effort to change culture, empower employees, and improve products and services, this could be a way to measure progress.  At the very least leaders approached to make a decision must constantly ask, "Could this decision have been made at a lower level in the organization?"

This concept could even be applied to knowledge management, as a "silo audit" of sorts.  Knowledge sharing within a team is critically important.  Ever heard of the bus number concept?  How many people on your team or in your org would have to get hit by a bus before progress would halt?*  I would guess that for a lot of teams and organizations that number is one.  What happens when that one person severs all communication ties for a period of time?  Better to practice and simulate this before it really happens!  Knowledge sharing, much like disaster recovery, must be automatic, and continuous.

We have a long way to go in this space, but maybe applying some of the same technical practices to our organizations will help us improve.  Or we could just buy a tool for that.....



* The bus number concept is obviously hyperbole (in most cases?), but this isn't that different from job transitions.  What happens when your bus #1 employee takes a different job.  Maybe it's within your company and that's a slight relief.  Or maybe they leave altogether.  Organization continuity planning is the knowledge and talent equivalent of disaster planning for critical IT systems.

Economies of Scale

Let's walk through the typical software development cycle for a new product.  Needs are identified, a team is put together, functionality built, and deployed.  As business needs evolve new features are developed, existing features are changed, and in general systems usually grow in size.  As the size of the system grows testing becomes a bottleneck for most companies.  In fact, companies that let this go on for too long end up with testing cycles that dwarf the development portion of the delivery process.  Why?  Testing needs to cover new features as well as existing features, so testing effort is theoretically equal to all past testing efforts plus the effort required to test whatever is new.  If manual, this obviously gets cumbersome quickly.

Enter test automation. The argument becomes clear to organizations that more of the testing process must be automated.  It's a process that is repeated over and over again.  Why would a company pay a tester to do perform the exact same task over and over again?  The logic is straightforward.  It seems to me that this is the reason that test automation practices are gaining a much larger foothold in companies in the last several years.

The Infrastructure as Code movement is much newer, however.  It seems to me that the argument for test automation (above) is not as applicable in the ops space.  If you consider the same product development cycle described above, but from an ops perspective, it might go something like this.  Needs are identified, teams request new infrastructure to host new solution(s), infrastructure is provisioned, new resources are folded into existing support processes.

"Infrastructure is provisioned".  Once.  Granted, in the case of Amazon, Netflix, Facebook, etc, this wouldn't hold true.  Everything they build is scaled massively.  That's not the case for a lot of companies, especially those in the Information Technology Dark Side.  Infrastructure is provisioned once and that's it.*  There is no economy of scale.

Now there is still an argument, it's just an argument that I think it harder to make and back up with concrete businessy goodness.  Testing?  Manual regression testing takes 1 month.  Our automation suite will take 4 minutes.  Boom.  Infrastructure?  Welllll, the automation is a repeatable, reliable process that will significantly improve our confidence in the provisioning and subsequent change process.  "That's all well and good, but you're telling me we need to write all this... what did you call it... infrastructure codeage?  That seems like a lot of work to stand up these 2 servers.  We'll just hammer them out."

I wholeheartedly believe that the real value in infrastructure as code + automation is the actual repeatable, reliable process that results in consistency.  The extremely beneficial by-product is that the result can scale across as many nodes as you want.  To me this is akin to the TDD argument.  A lot of very smart people pointed out that the real value in TDD is the improved code design.  The fact that the resulting tests form a regression suite is just icing on the cake.  But neither of those real reasons is an easy argument to make to management.

Cost-focused organizations and managers want economies of scale.  They affect the bottom line.  It seems to me that test automation is an easier argument to make.  What am I missing?  What's the business case for infrastructure as code / automation?  How do you frame it up in a way that connects to concrete business value?



* You could certainly make the argument that subsequent changes to the infrastructure should be vetted through an automated test suite process similar to the one I described for application code.  That's fair.  And I'm sure people do it.  That just feels even less tangible to me right now.