Much of the security of your AWS implementation is foundational. Decisions that you make early on have the potential to impact the architecture of your system for a long time, often longer than you think. And because these factors are so foundational, changing them has more impact, often in the form of downtime or a more complex migration. It's important to carefully consider these up front and make your implementation as good as you can for your use case, but know that it will need to change at some point.
This is an easy decision to make, but a hard one to honor. It's vitally important to drive your infrastructure definitions through version control for traceability and to provision them only from those versioned changes for reproducibility. Traceability becomes critical for audit and compliance. Reproducibility is the bedrock for delivering high quality solutions. Environments must be consistent to test potential release candidates and improve the system through testing cycles. Automation also has the positive side effect of being quicker and easy to run. In the cloud this means that resources can be built and torn down at will. Treating your resources in this way, as cattle and not pets, can yield cost savings, scalability, improved security, and improved system testability.
The VPC is the AWS resource that represents an isolated network. When deciding how to scope your VPC(s) it's important to consider your options for network level security. You'll have a stateless network ACL at the VPC level, giving you the ability to allow and deny CIDR ranges for all subnets to which the network ACL is attached. Rules on network ACLs should be broad security definitions. Here are some AWS recommended ACL rules for your VPC. For instance, if you expect internet traffic into your VPC you may choose to allow port 443 globally. You will likely need to create rules for intra-VPC, inter-subnet traffic too, as the network ACL operates at the subnet level. That is, traffic leaving subnet-A destined for subnet-B, will be filtered by the attached network ACL, even though subnet-A and subnet-B are in the VPC. These ACLs should be broader definitions, possibly port ranges for microservices, but should be secured to your VPC CIDR range. An important sidenote - since the network ACLs are stateless they are unaware of connections sourced within your VPC. You'll have to craft inbound rules that allow responses to traffic generated by your applications and services (and vice versa). Overall, when building network ACL rules think broad, environment-wide rules.
The network filtering complement to the VPC network ACL is the security group. Think of security groups as single purpose firewalls. These are stateful and operate most effectively at the resource level rather than the overall network level. Here are some important notes on security groups.
Automate Everything
This is an easy decision to make, but a hard one to honor. It's vitally important to drive your infrastructure definitions through version control for traceability and to provision them only from those versioned changes for reproducibility. Traceability becomes critical for audit and compliance. Reproducibility is the bedrock for delivering high quality solutions. Environments must be consistent to test potential release candidates and improve the system through testing cycles. Automation also has the positive side effect of being quicker and easy to run. In the cloud this means that resources can be built and torn down at will. Treating your resources in this way, as cattle and not pets, can yield cost savings, scalability, improved security, and improved system testability.
VPC Network ACL
The VPC is the AWS resource that represents an isolated network. When deciding how to scope your VPC(s) it's important to consider your options for network level security. You'll have a stateless network ACL at the VPC level, giving you the ability to allow and deny CIDR ranges for all subnets to which the network ACL is attached. Rules on network ACLs should be broad security definitions. Here are some AWS recommended ACL rules for your VPC. For instance, if you expect internet traffic into your VPC you may choose to allow port 443 globally. You will likely need to create rules for intra-VPC, inter-subnet traffic too, as the network ACL operates at the subnet level. That is, traffic leaving subnet-A destined for subnet-B, will be filtered by the attached network ACL, even though subnet-A and subnet-B are in the VPC. These ACLs should be broader definitions, possibly port ranges for microservices, but should be secured to your VPC CIDR range. An important sidenote - since the network ACLs are stateless they are unaware of connections sourced within your VPC. You'll have to craft inbound rules that allow responses to traffic generated by your applications and services (and vice versa). Overall, when building network ACL rules think broad, environment-wide rules.
Security Groups
The network filtering complement to the VPC network ACL is the security group. Think of security groups as single purpose firewalls. These are stateful and operate most effectively at the resource level rather than the overall network level. Here are some important notes on security groups.
- Stateful - no need to worry about rules for response traffic. They're aware of active connections and allow accordingly.
- Can be attached to a number of resources including EC2 instances, load balancers, and RDS instances, just to name a few.
- Take advantage of the fact that these can be applied granularly. You can get a win-win in the form of security and documentation if you apply rules at the most granular level (i.e. the service or application). You'll only be allowing what's absolutely necessary (security++). And you will now have documented(++) the fact that X service requires this/these port(s). If you want to take this a step further you can even write a test suite that checks your security groups, asserting these granular rules. The unfortunate reality is that it's extremely easy to rely on too few security groups, apply broad rules across multiple servers / systems. You end up with a confusing, coupled, security group spiderweb that's extremely challenging to untangle. Check out aws-security-viz to see a visualization of your security groups. Very enlightening.
- Limits - there are limits on the number of rules per security group, how many security groups per VPC, and how many security groups a resource can have. The limits change, but start with: VPC service limits
Subnet Accessibility
Most companies typically design their network with at least an internal network and an externally-accessible network, often called a DMZ. You can achieve this same effect in at least 2 different ways with the VPC.
Your first option is what I'll call the all-in-one subnet option. Create one subnet per AZ, as usual, and logically divide internal vs internet-facing resources servers by only allocating public IP address to internet-facing servers. Servers that only get a private IP address will not be accessible on the internet, achieving the typical internal subnet effect. Because public and private servers are colocated in the same subnet your network ACL(s) will have rules for internet traffic and inter-subnet traffic. This isn't necessarily bad, just something to be aware of. And you'll of course have security groups wrapping your servers with more specific rules. With this approach you will need to decide whether your subnets default to assigning a public IP address, or not. My preference is to avoid mistakes, default to private, and require deployed servers to explicitly assign a public IP. Now, the most significant downside to this approach is you'll likely want at least some of your private servers to still access the internet for something - hit APIs, download packages, etc. Internet-accessible servers require an internet gateway, whereas private servers will require a NAT to access the internet. The route table destination for both of these would normally be 0.0.0.0/0. So, unless you're willing to identify the destination CIDRs for the private servers you'll have to go with the second option.
The second option is what I'll call the DMZ subnet option. In this scenario you're not mixing internet-facing and internal servers in the same subnet. You'll have public and private subnets. Here are the basic steps:
Your first option is what I'll call the all-in-one subnet option. Create one subnet per AZ, as usual, and logically divide internal vs internet-facing resources servers by only allocating public IP address to internet-facing servers. Servers that only get a private IP address will not be accessible on the internet, achieving the typical internal subnet effect. Because public and private servers are colocated in the same subnet your network ACL(s) will have rules for internet traffic and inter-subnet traffic. This isn't necessarily bad, just something to be aware of. And you'll of course have security groups wrapping your servers with more specific rules. With this approach you will need to decide whether your subnets default to assigning a public IP address, or not. My preference is to avoid mistakes, default to private, and require deployed servers to explicitly assign a public IP. Now, the most significant downside to this approach is you'll likely want at least some of your private servers to still access the internet for something - hit APIs, download packages, etc. Internet-accessible servers require an internet gateway, whereas private servers will require a NAT to access the internet. The route table destination for both of these would normally be 0.0.0.0/0. So, unless you're willing to identify the destination CIDRs for the private servers you'll have to go with the second option.
The second option is what I'll call the DMZ subnet option. In this scenario you're not mixing internet-facing and internal servers in the same subnet. You'll have public and private subnets. Here are the basic steps:
- Create two subnets per AZ. For the internal ones make them default to not assigning a public IP address. For the internet-facing ones (DMZ) make them default to assigning a public IP address.
- Your internal subnets will likely require larger CIDR blocks. The DMZ subnets should only really host your NAT Gateway and your internet-facing load balancers. Unless you have a ton of ELBs, these subnets can likely be smaller, although there's no downside (other than available IPs) to making them large.
- Build two network ACLs. One will be attached to the DMZ subnets. For that one you should limit the traffic, ideally, to HTTPS (443) only. If you can't, grant the minimum. Remember, it's stateless, so you'll have to add rules for response traffic and traffic to / from your internal subnets. The second ACL will be attached to the internal subnets. This should be more restrictive, likely without any 0.0.0.0/0 rules.
- Create a NAT Gateway. Attach this to your public subnets. This is how your internal servers will still be able to reach outbound to the internet.
- Create an Internet Gateway.
- Create two route tables. The internal route table should have a 0.0.0.0/0 rule pointing to the NAT gateway you just created. The DMZ route table should have 0.0.0.0/0 pointing to the Internet Gateway.
An important caveat on security group limits
Most of these limits are in place to ensure that AWS can guarantee service levels. Obviously, the more rules that must be evaluated to make a network filtering decision, the more time and resources required. Behind the scenes your security group rules evaluate to jump rules. If you specify a CIDR range in a rule, this evaluates to one jump rule. A direct IP comparison can be done on the traffic. If you reference another security group from which traffic should be allowed this can result in a large number of jump rules. Essentially, the referenced security group is taken and the resources to which the security group is attached are resolved. If the referenced security group was attached to 10 other AWS resources, then 10 rules would be created behind the scenes - one per attached resource, so that the underlying firewall can do the proper IP comparisons. In essence, referencing security groups is very convenient, but suboptimal when applied broadly.
I recommend specifying CIDR ranges wherever possible. It does, however, make perfect sense to reference security groups in an individual deployment stack. For instance, if you have a typical stack with load balancer, app server, and database you can give each their own security group. The load balancer might allow port 443 for HTTPS traffic, then be forwarding to port 8080 on the app server. The server security group can allow port 8080, but reference the load balancer security group, effectively allowing traffic only from the load balancer. Similarly, the database might allow traffic on port 3306, but reference the server security group, allowing traffic only from the app server.
If you craft rules that result in a large number of jump rules you will likely get restricted to a maximum of 100 security groups in your VPC. Rearchitecting out of this design can be a significant undertaking. I highly recommend that you instead design your AWS environment to have smaller VPCs in which you expect deployed applications / services to communicate with one another. This will allow you to mostly rely on CIDR ranges. If you instead deploy disparate applications to the same VPC and expect to filter their network traffic you'll be forced to create logical "environments" within your VPC using security groups. This is where the jump rules start, and proliferate. At a minimum, plan to deploy separate VPCs for each of your environments - test, prod, etc. That way you can prevent test services from talking to production services with CIDR-based rules. Then you can deploy individual stacks with security group references like in the example above. You will have more security groups, but AWS will approve an increase from the initial 100 limit because your total number of jump rules will be very low.
What else?
There are a lot of considerations when you design your AWS environment. In upcoming posts I'll talk about:
Hopefully this helps as you think about your VPC design. What other considerations / implications have you come across that impacted your design choices? Did you discover anything after the fact that caused a significant redesign? I'd love to hear from you.
I recommend specifying CIDR ranges wherever possible. It does, however, make perfect sense to reference security groups in an individual deployment stack. For instance, if you have a typical stack with load balancer, app server, and database you can give each their own security group. The load balancer might allow port 443 for HTTPS traffic, then be forwarding to port 8080 on the app server. The server security group can allow port 8080, but reference the load balancer security group, effectively allowing traffic only from the load balancer. Similarly, the database might allow traffic on port 3306, but reference the server security group, allowing traffic only from the app server.
If you craft rules that result in a large number of jump rules you will likely get restricted to a maximum of 100 security groups in your VPC. Rearchitecting out of this design can be a significant undertaking. I highly recommend that you instead design your AWS environment to have smaller VPCs in which you expect deployed applications / services to communicate with one another. This will allow you to mostly rely on CIDR ranges. If you instead deploy disparate applications to the same VPC and expect to filter their network traffic you'll be forced to create logical "environments" within your VPC using security groups. This is where the jump rules start, and proliferate. At a minimum, plan to deploy separate VPCs for each of your environments - test, prod, etc. That way you can prevent test services from talking to production services with CIDR-based rules. Then you can deploy individual stacks with security group references like in the example above. You will have more security groups, but AWS will approve an increase from the initial 100 limit because your total number of jump rules will be very low.
What else?
There are a lot of considerations when you design your AWS environment. In upcoming posts I'll talk about:
- s3
- iam - users, groups, roles, and policies
- logging
Hopefully this helps as you think about your VPC design. What other considerations / implications have you come across that impacted your design choices? Did you discover anything after the fact that caused a significant redesign? I'd love to hear from you.