Friday, January 15, 2016

AWS Advice - Network Security

Much of the security of your AWS implementation is foundational.  Decisions that you make early on have the potential to impact the architecture of your system for a long time, often longer than you think.  And because these factors are so foundational, changing them has more impact, often in the form of downtime or a more complex migration.  It's important to carefully consider these up front and make your implementation as good as you can for your use case, but know that it will need to change at some point.

Automate Everything


This is an easy decision to make, but a hard one to honor.  It's vitally important to drive your infrastructure definitions through version control for traceability and to provision them only from those versioned changes for reproducibility.  Traceability becomes critical for audit and compliance.  Reproducibility is the bedrock for delivering high quality solutions.  Environments must be consistent to test potential release candidates and improve the system through testing cycles.  Automation also has the positive side effect of being quicker and easy to run.  In the cloud this means that resources can be built and torn down at will.  Treating your resources in this way, as cattle and not pets, can yield cost savings, scalability, improved security, and improved system testability.

VPC Network ACL


The VPC is the AWS resource that represents an isolated network.  When deciding how to scope your VPC(s) it's important to consider your options for network level security.  You'll have a stateless network ACL at the VPC level, giving you the ability to allow and deny CIDR ranges for all subnets to which the network ACL is attached.  Rules on network ACLs should be broad security definitions.  Here are some AWS recommended ACL rules for your VPC.  For instance, if you expect internet traffic into your VPC you may choose to allow port 443 globally.  You will likely need to create rules for intra-VPC, inter-subnet traffic too, as the network ACL operates at the subnet level.  That is, traffic leaving subnet-A destined for subnet-B, will be filtered by the attached network ACL, even though subnet-A and subnet-B are in the VPC.  These ACLs should be broader definitions, possibly port ranges for microservices, but should be secured to your VPC CIDR range.  An important sidenote - since the network ACLs are stateless they are unaware of connections sourced within your VPC.  You'll have to craft inbound rules that allow responses to traffic generated by your applications and services (and vice versa).  Overall, when building network ACL rules think broad, environment-wide rules.

Security Groups


The network filtering complement to the VPC network ACL is the security group.  Think of security groups as single purpose firewalls.  These are stateful and operate most effectively at the resource level rather than the overall network level.  Here are some important notes on security groups.

  • Stateful - no need to worry about rules for response traffic. They're aware of active connections and allow accordingly. 
  • Can be attached to a number of resources including EC2 instances, load balancers, and RDS instances, just to name a few.
  • Take advantage of the fact that these can be applied granularly.  You can get a win-win in the form of security and documentation if you apply rules at the most granular level (i.e. the service or application).  You'll only be allowing what's absolutely necessary (security++).  And you will now have documented(++) the fact that X service requires this/these port(s).  If you want to take this a step further you can even write a test suite that checks your security groups, asserting these granular rules.  The unfortunate reality is that it's extremely easy to rely on too few security groups, apply broad rules across multiple servers / systems.  You end up with a confusing, coupled, security group spiderweb that's extremely challenging to untangle.  Check out aws-security-viz to see a visualization of your security groups.  Very enlightening.
  • Limits - there are limits on the number of rules per security group, how many security groups per VPC, and how many security groups a resource can have.  The limits change, but start with:  VPC service limits

Subnet Accessibility


Most companies typically design their network with at least an internal network and an externally-accessible network, often called a DMZ.  You can achieve this same effect in at least 2 different ways with the VPC.

Your first option is what I'll call the all-in-one subnet option.  Create one subnet per AZ, as usual, and logically divide internal vs internet-facing resources servers by only allocating public IP address to internet-facing servers.  Servers that only get a private IP address will not be accessible on the internet, achieving the typical internal subnet effect.  Because public and private servers are colocated in the same subnet your network ACL(s) will have rules for internet traffic and inter-subnet traffic.  This isn't necessarily bad, just something to be aware of.  And you'll of course have security groups wrapping your servers with more specific rules.  With this approach you will need to decide whether your subnets default to assigning a public IP address, or not.  My preference is to avoid mistakes, default to private, and require deployed servers to explicitly assign a public IP.  Now, the most significant downside to this approach is you'll likely want at least some of your private servers to still access the internet for something - hit APIs, download packages, etc.   Internet-accessible servers require an internet gateway, whereas private servers will require a NAT to access the internet.  The route table destination for both of these would normally be 0.0.0.0/0.  So, unless you're willing to identify the destination CIDRs for the private servers you'll have to go with the second option.

The second option is what I'll call the DMZ subnet option. In this scenario you're not mixing internet-facing and internal servers in the same subnet.  You'll have public and private subnets.  Here are the basic steps:

  • Create two subnets per AZ.  For the internal ones make them default to not assigning a public IP address.  For the internet-facing ones (DMZ) make them default to assigning a public IP address.
  • Your internal subnets will likely require larger CIDR blocks.  The DMZ subnets should only really host your NAT Gateway and your internet-facing load balancers.  Unless you have a ton of ELBs, these subnets can likely be smaller, although there's no downside (other than available IPs) to making them large.
  • Build two network ACLs.  One will be attached to the DMZ subnets.  For that one you should limit the traffic, ideally, to HTTPS (443) only.  If you can't, grant the minimum.  Remember, it's stateless, so you'll have to add rules for response traffic and traffic to / from your internal subnets.  The second ACL will be attached to the internal subnets.  This should be more restrictive, likely without any 0.0.0.0/0 rules.
  • Create a NAT Gateway.  Attach this to your public subnets.  This is how your internal servers will still be able to reach outbound to the internet.
  • Create an Internet Gateway.
  • Create two route tables.  The internal route table should have a 0.0.0.0/0 rule pointing to the NAT gateway you just created.  The DMZ route table should have 0.0.0.0/0 pointing to the Internet Gateway.

An important caveat on security group limits

Most of these limits are in place to ensure that AWS can guarantee service levels.  Obviously, the more rules that must be evaluated to make a network filtering decision, the more time and resources required.  Behind the scenes your security group rules evaluate to jump rules.  If you specify a CIDR range in a rule, this evaluates to one jump rule.  A direct IP comparison can be done on the traffic.  If you reference another security group from which traffic should be allowed this can result in a large number of jump rules.  Essentially, the referenced security group is taken and the resources to which the security group is attached are resolved.  If the referenced security group was attached to 10 other AWS resources, then 10 rules would be created behind the scenes - one per attached resource, so that the underlying firewall can do the proper IP comparisons.  In essence, referencing security groups is very convenient, but suboptimal when applied broadly.

I recommend specifying CIDR ranges wherever possible.  It does, however, make perfect sense to reference security groups in an individual deployment stack.  For instance, if you have a typical stack with load balancer, app server, and database you can give each their own security group.  The load balancer might allow port 443 for HTTPS traffic, then be forwarding to port 8080 on the app server.  The server security group can allow port 8080, but reference the load balancer security group, effectively allowing traffic only from the load balancer.  Similarly, the database might allow traffic on port 3306, but reference the server security group, allowing traffic only from the app server.

If you craft rules that result in a large number of jump rules you will likely get restricted to a maximum of 100 security groups in your VPC.  Rearchitecting out of this design can be a significant undertaking.  I highly recommend that you instead design your AWS environment to have smaller VPCs in which you expect deployed applications / services to communicate with one another.  This will allow you to mostly rely on CIDR ranges.  If you instead deploy disparate applications to the same VPC and expect to filter their network traffic you'll be forced to create logical "environments" within your VPC using security groups.  This is where the jump rules start, and proliferate.  At a minimum, plan to deploy separate VPCs for each of your environments - test, prod, etc.  That way you can prevent test services from talking to production services with CIDR-based rules.  Then you can deploy individual stacks with security group references like in the example above.  You will have more security groups, but AWS will approve an increase from the initial 100 limit because your total number of jump rules will be very low.

What else?

There are a lot of considerations when you design your AWS environment.  In upcoming posts I'll talk about:

  • s3
  • iam - users, groups, roles, and policies
  • logging

Hopefully this helps as you think about your VPC design.  What other considerations / implications have you come across that impacted your design choices?  Did you discover anything after the fact that caused a significant redesign?  I'd love to hear from you. 

Sunday, July 26, 2015

Jurassic Delivery

Who didn't love when Jurassic Park hit in 1993?  Hot damn if I didn't own that on VHS as soon as it came out.  Velociraptors. T-Rexes. 3-D GUI Unix operating systems.  It was the shit.  There have been some downturns, but when Jurassic World came out with Chris Pratt in a leading role I couldn't pass it up.  Clearly things had been taken to the next level and I needed to submit and enjoy.  I saw it later than most and, despite reviews I heard, I thoroughly enjoyed it.

I love when humans try to control the uncontrollable and pay the price.  If you're going to build a dinosaur park... I don't care how much effort you put into trying to control the dinosaurs... you are going to lose.  There's something very gratifying for me watching these wannabe puppeteers suffer the T-Rex bite they genetically engineered, bred fierce, and sought to tame.

Software delivery is a vicious dinosaur.  You start with an idea.  So innocent, harmless, but requiring so much care and nurturing.  You grow it gradually.  It begins to require more and different kinds of care to keep progressing.  As your little dino grows you realize that he's getting unruly and you need to build some safeguards into your system and delivery process.  Maybe you need to add some tests, some automation.  Before you're done with any of that... BAM a major bug hits.  Your dino's fully formed teeth are capable of biting through the steel cable you engineered to keep him in.  We'll introduce a stronger, higher gauge cable, AND let's electrify it.

Lather. Rinse. Repeat.

We all face significant challenges that we need to solve on behalf of our customers.  There will be an endless string of problems that we run into along the way.  We'll solve some of those by leveraging any combination of libraries, frameworks, and solutions - some well-known, some well-understood, some cutting edge, and some not well-understood.

You have options.  More well-known technologies likely have a larger support base.  They probably also have more and better documentation, and a community that is seeking and solving its problems. On the flip-side, technologies that are more well-known and understood may not be solving the latest problems, and maybe not in the most effective way.

Cutting edge technologies are likely solving or streamlining more problems, or more significant problems.  They're a jump forward of sorts.  Maybe you can solve the problem in far fewer lines of code.  Maybe concurrency is simplified immensely. Maybe deployment and scalability become low-hanging fruit.  Being a newer technology though, it is likely not as well-understood, possibly not as well-documented.  Certainly the adoption level is lower, which means that community support is going to be lower.

The former might result in a more manageable, tamable stegosaurus.  It's less rapid than other approaches, but consistent.  Your stegosaurus is going to eat, sleep, and shit in a fairly predictable manner.  It's not going to bite you, and any fires it may start will be manageable.

The latter might result in an unmanageable, unpredictable T-Rex.  It's fast, vicious, and will bite your head off as soon as it gets the opportunity.

Remember when Dr. Grant and the kids were running through the field as a flock of dinos ran past them?


These guys seem reasonable.  Extremes are rarely the right answer.  Maybe a Gallimimus software delivery pipeline is a proper middle ground?

Whether you realize it or not, you own the characteristics of your delivery pipeline.  The series of choices you made since the inception of your idea created your pipeline.  It's not done though.  You are constantly adjusting and molding it.

Investing in change... in new and valuable technology is important.  After all, who doesn't want to avoid solving solved problems, and leap forward?  But, we must treat it as an investment.  If it's the right business decision do it, but do it intelligently.  Don't take on a conversion or technological change expecting stegosaurus-like outcomes.  You might have a raptor on your hands.  You have no business taking on a change like this without an investment in understanding the ways that it can bite you and accounting for those.  If you think you're going to convert from a COBOL stack to a Java stack, a .NET stack to a Scala stack, an on-prem stack to a cloud stack, or any other kind of major conversion, expect unpredictability.  Likewise, don't breed a T-Rex / velociraptor hybrid without expecting some casualties.

As you move forward make intentional decisions about the state of your pipeline.  Consider:

  • Your current state
  • The relative newness of the thing you're evaluating.  Is it well-known and understood?
  • The learning level and talent of your engineering organization
  • The trust level your management team has for your engineering organization

Maybe the right thing for your organization is to engineer your own dinosaur.  But mayyyybe not an Indominus Rex.  Maybe a stegoraptor though?




Tuesday, July 21, 2015

Thoughts on DevOps

DevOps, like most paradigm shifting buzz words, has become an overloaded, muddled term.  I've been thinking about this a lot lately and here are some uncategorized thoughts on this rapidly evolving area of software delivery.

(Most of these statements should probably start with.. "regardless of where you are today")

  • Focus on customers.  The culture and organizational change associated with DevOps should make everyone involved in the delivery of a solution (including infrastructure roles) more aligned and accountable to the customer.
  • I view this primarily as a movement to apply engineering practices to infrastructure and operations management.  Versioning, automation, testing.
  • DevOps is about dependency removal, in much the same way that agile is.  There's no better way to remove dependencies than to add the function of that dependency to teams requiring it. Add people with the skills, or grow the skill set within the team.
  • I believe one maximizes agility by removing all dependencies and allowing a team to create, manage, and run its entire stack.  
  • Teams running their entire stack leaves the potential for similar, possibly duplicate effort across teams.  While duplication is evil, its tradeoff is agility.
  • A centralized "DevOps" team is completely reasonable in my view, but it should not be how an organization starts.  The formation of a central DevOps team should be a conscious decision to follow the DRY principle - to remove duplication that has emerged organically.  As well, the team must not become a bottleneck as teams evolve / change. 
  • Ownership boundaries are clearer with fewer dependencies.  If a dev teams owns the app code and ops owns the infrastructure, who addresses an unclear problem near the boundary?
Still thinking...

Wednesday, February 25, 2015

The Feature Toggle Antipattern

The move toward a more agile software delivery model requires the adoption of improved technical practices.  One of the first is generally the concept, and tooling associated with, continuous integration (CI).  The adoption of CI practices yields other challenges, one of which is partially complete features.  Many features take much longer to complete than a best practice integration cycle.

Feature toggles are a very useful way to solve this problem.  By employing this concept you can effectively decouple commits and code integration from the release of a feature in that code.  This is very powerful.  We can now get the benefits of continuous integration without the obvious issue of exposing a partially completed feature.

Like with all things we can take this concept too far.  Martin Fowler advocates for avoiding feature toggles to hide things in production:
Your first choice should be to break the feature down so you can safely introduce parts of the feature into the product. The advantages of doing this are the same ones as any strategy based on small, frequent releases. You reduce the risk of things going wrong and you get valuable feedback on how users actually use the feature that will improve the enhancements you make later.
He simply suggests embracing your agility.  The need for these toggles means you're already releasing to production more frequently than you can complete features.  Why not break the work down further, and learn from each release?  Here Martin is suggesting avoiding features that will take longer than your release cycle.  Instead, break them down.  But a frequent production release cycle is good.  Don't attempt to solve this by releasing less frequently...

Not everyone may be able to accomplish this easily though.  It's a worthy goal to improve to over time.  However, teams need to monitor for over reliance on these toggles.  So a couple things to watch out for:
  1. Completed, but not released features.  If you've completed a feature you should be ready to release it.  Otherwise what could you have been working on instead that could be released today and have added customer value today?
  2. The number of hidden features.  If #1 is a problem, this is likely also a problem.  However, this can also manifest if you have too much work in progress (WIP).  Reducing WIP can drive feature completion, therefore releasability, and therefore customer value.
If taken to an extreme the number and scope of features that are hidden in production can reach cumbersome levels.  I call this the Feature Toggle Antipattern.  In its worst form agile teams lose sight of their stockpile of not-yet-released features, even releasing (i.e. no longer hiding) features less frequently than in a typical waterfall project.

In a waterfall project there is a clear beginning and end, often to a fault.  That's one of the things that agile overcomes very successfully.  Aligning teams around a product and driving features through that team eliminates the on / off nature of waterfall projects - the eminent big bang.

With frequent releases (no big bang) it's easy to lose sight of the customer value that's hiding in your toggles.  When it's time to actually release those features you could end up with a big (detoggling) bang that dwarfs its waterfall alternative.  There will be other loss too aside from the customer value opportunity loss.  If you encounter any issues, or as you get feedback from customers you'll need to change and adapt.  Many of those features were developed a long time ago and you have all the cost of context switching and refamiliarization.

If you're going to use feature toggles to deliver your product make sure you avoid this antipattern.  You'll avoid many of the heartaches that drove you to embrace agile in the first place.

Saturday, January 31, 2015

The Cloud Decision

For some there may not even be a debate when it comes to the cloud.  The flexibility and scalability that it offers small companies that don't want to build their own datacenter and are not sure about the size of their customer base (their viral coefficient even), is invaluable.  For larger companies the evaluation is generally more difficult.  The primary driver is usually cost, and that savings must be measured against all the other change that's necessary - in security, architecture, and governance to name just a few.  Any good startup (or smaller company) has a strong culture of innovation.  After all that's how startups start.  Innovation is a critical element of the cloud decision making process that too easily gets lost in the evaluation for larger companies.  This, and other value generating endeavors are amplified by a service enabled infrastructure.  Particularly one where self-service is encouraged.

Looking at the cloud decision from the CIO level is just too high.  Doing that is going to dismiss the most significant benefits.  I imagine the typical evaluation goes something like this:

  • Cost.  In the cloud we can pay for what we use, and not worry about underutilized assets. Ok, that's a +.
  • We'll really need to ramp up security.
  • Let's do it!

It may be true that an organization will lower costs by doing just this.  However, there is an enormous amount of lost opportunity in making this move so naively.

If you have an on-premise datacenter you probably have dedicated infrastructure teams.  You've probably also built up processes for development teams to interact with those infrastructure teams. Focusing on cost and ignoring the self-service, service-enabled nature of cloud providers might cause you to reimplement your existing datacenter, architecture, and processes in the cloud, avoiding the majority of the benefits.

Let's take a simple example that I recently heard to illustrate a company that recognized the value of the self-service model and reaped the benefits.  A developer at a larger company supported an existing, cumbersome process to make regularly released files available to external parties.  That process was one where, once her releasable artifact was built, she sent and notified another team of its availability.  At that point the other team would "approve" its release and put it on an external facing FTP site.  The process took several days on average.

In their cloud migration / implementation this developer was empowered to use the available services.  She recognized that the cloud storage solution now available could easily replace the FTP site and the to-be-released artifact could be automatically sent to the storage solution directly from the build process. The net result was an automated, reliable process with immediate results rather than a multi-day lead time.

There are two reasons this succeeded.

  1. The developer was intimately familiar with the process.  Enough so such that she could recognize and implement the improvement.
  2. There was a conscious decision to empower her; to allow her access to cloud services.  Her organization could have easily restricted access the storage solution such that only the FTP team (or other infrastructure team) had access.
An enormous benefit of moving to the cloud is its service-enabled nature.  Organizations with manual processes and hand-offs have all the opportunity in the world to take advantage of this.  There is a key though.  The true benefit only occurs when there is a reduction in dependencies.  Reduction. In. Dependencies. This must mean that requests are not necessary; that teams can "request" infrastructure via a console or API, on-demand, and not depend on an external entity.  

The on-premise datacenter versus cloud provider decision is a difficult one.  It's one that I do not think should be made lightly, and one that should not be made for cost reasons alone.  Organizations need to make sure they recognize the real benefits and take advantage of them.  This can be a very large change.  In many cases it's a culture change, an architecture change, and a governance change.  Think through what is means to reduce dependencies.  Roles may change, and skill sets may be challenged.  This requires great leadership, trust, and maturity to accomplish successfully.  I'll end with a sobering statistic from VMWare:
63 percent of Amazon AWS projects are considered failed, compared to 57 percent of projects on Rackspace and 44 percent of Microsoft Azure projects.

Saturday, January 10, 2015

Single Responsibility Principle 2.0

I see this concept coming back a lot as of late, at least in my ongoing learning and discovery of good design and architecture.  The Single Responsibility Principle is one of the SOLID principles for good object oriented design and development. I think the SRP, and likely many other principles, have become applicable at higher levels of abstraction.

Some amazing advances in approaches to infrastructure and configuration management, deployment, scaling have hit our industry hard in recent years. Historically, before virtual machine technology was really the norm there was still a need and desire to consolidate applications and services to run on a minimal physical footprint.  After all you want to use the hardware that you've purchased effectively. You don't want to run a beefy server at 10% utilization all day, so you load it up to more fully utilize it. This drives a lot of coupling at the infrastructure level of our architectures.

When VM technology became heavily adopted this started to become less of an issue.  You could use the same physical hardware, but host many logical VMs on it.  Thus, you've separated more of the concerns, thereby further decoupling co-hosted applications.  This improves architecture, but trades for increased complexity and load on infrastructure teams.  If we're going to build more single-purpose servers then we'll need more VMs.  This spawned the need for greater automation in the infrastructure space.  Along come tools like Puppet, Chef, SaltStack, and Ansible.  These tools have done an amazing job fulfilling exactly this need.  Write your infrastructure as code, version it, and leverage it for infrastructure, on demand.

We now have tools that enable us to rethink our approach to design and architecture in support of the single responsibility principle.  In the early 2000s when Uncle Bob wrote about the SRP he raised visibility to a concern that we should be asking ourselves as we design and change classes.  It's now 2015 and our tools have advanced incredibly.  We need to ask ourselves this same question as we design all layers of our systems.
How can we leverage the SRP to reduce coupling not just in our classes, but in our application / service design, and in our infrastructure design?  
Modern tools enable, and frankly necessitate, that we ask ourselves this question across our entire stack.  It's really this thinking that drives teams and organizations to microservice architectures.  A class that is designed with a single responsibility changes for one reason and one reason alone.  This keeps classes small, thus easier to change.  Services should be easy to change as well.  What better way than to enable that than to give each its own, single purpose and therefore small, changeable implementation.

I have been particularly interested in Docker lately.  I believe it adds a tremendous amount of value in this space.  Despite great tools like Puppet, Chef, etc., their approach is still bulkier.  With Docker's lightweight VMs the creation of individual service or application images is fast, and spawning them is even faster.  You can start a Docker container just as fast as your service itself can start.  Docker also really shines in its simplistic mechanism for linking together containers for interactions.  If you have a typical web server, app server, database architecture it does not take long to get those pieces Dockerized and running together, linked, in a Docker environment. I really like what Fig did to simplify this even further.

Many of us likely have some work to do to improve our applications and systems in consideration of the SRP.  There won't be any shortage of work there.  The good news is that there is little holding us back when it comes to available tooling.  It's there and most of it is free.  Despite the challenges it seems to me that it's worth giving some thought and moving in that direction.  More easily changed services will yield business agility, and therefore customer satisfaction.

Tuesday, October 28, 2014

Agility, Technical Leadership and the so-called Talent Shortage

My exposure to the agile community has not been super broad, but from what I have seen people tend to talk about agile in two forms.  Or at least I perceive two distinct discussions.  There is the "process" side of agile.  This includes things like planning, communication, team norms, managing deliverables, prioritization, etc.  Then there's what I call the technical practices side of agile.  This includes practices like continuous integration, test first development, automation, pairing, etc. - mostly those practices borne out of extreme programming (XP).

I'm a whole-hearted believer in these technical practices.  I believe it's these practices that form the foundation for agility.  You can do XP without the agile processes, but you can't be agile without XP.  Even doing XP in a pure waterfall world would yield huge productivity gains.  For that reason I mostly equate one's agility with one's strength in technical practices.  That's not to say there aren't significant gains to be had with process change.  You just can't get functionality in the hands of customers faster than you can build, test, and release it.  And the speed of your build, test, and release cycle is equivalent to how much of it you've automated.

A few years ago Andy Singleton posted Tech Leads Will Rule the World.  I've had it bookmarked ever since.  I liked what it had to say then.  I firmly believe it now.  Businesses are desperately fighting for agility as competition continues to increase and as software disrupts our world.  To achieve the kind of agility that is so critical now and for future business viability, technical practices could not be more important.

When it comes to adopting technical practices there is no one more important than tech leads.  These are the key influencers with the ability to set team norms, and most importantly, the ability to lead by example.  The only way to successful adoption is strong technical leadership.  The quickest way to peril is poor technical leadership.

But how do we all get there?  Especially with all this talk of the technical talent shortage.  Andrew Clay Shafer has taken a strong position on this topic.  He's well worth listening to.  Yes, we need to attract great talent.  First and foremost we need to look internally and treasure the strong technical leadership that we have today.  Your tech leads have a profound impact on the culture of the teams with which they work.

Do you know where your true tech leads are today?  If you're not sure look no further than your highest performing teams (they'll have the best technical practices).  And your "tech leads", well, you know how to identify them now too.