Switching to NAT Instances to Save Costs
— Startups, DevOps, AWS — 5 min read
High costs of NAT Gateways
As per AWS VPC pricing page, as of today, the cost of a NAT Gateway is $0.045 per hour plus $0.045 per GB processed. It means that you will be charged around $32.40 per month for each NAT Gateway, plus the data processing costs. This can add up quickly, especially for startups and small businesses with limited budgets. Though the fixed cost is low, and the operational costs may make it seem that NAT Gateways are a good choice, once they start burning a hole in the budget, it becomes a concern. And the best alternative is to switch to NAT Instances.
Decisions involved in making the switch
- If you have access to Cloudwatch Metrics. Then you should look at the NAT Gateway metrics to see how much data flows through the resource. The important metrics are:
- EC2 NatGateway: PeakBytesPerSecond: Maximum
- EC2 NatGateway: BytesInFromDestination: Sum
- EC2 NatGateway: BytesInFromSource: Sum
- EC2 NatGateway: BytesOutToDestination: Sum
- EC2 NatGateway: BytesOutToSource: Sum
- EC2 NatGateway: ActiveConnectionCount: Maximum
- If you have access to Cost Explorer, you can analyse the cost of NAT Gateway with the total data processing charges. This can be done by selecting the following for "Usage Type" dropdown- NatGateway-BytesandNatGateway-Hours.
- With these 2 reports in hand, you can match up the data and fill in the inconsistencies that one AWS system has with the other, allowing you to make a more informed decision about switching to NAT Instances. The most important metric which should come out is - Max throughput required at any given moment.
Example
In my case, I found that the max throughput required was around 3.25 GBps =~ 25 Gbps.
Unless your application is very critical, not having enough throughput will still work for your application. It will just become a tad bit slower when load is high.
Once you've calculated the throughput requirement, you can proceed to the next section.
Choosing to go NAT Instance way
Simplifying the decision
- If application is critical and cannot tolerate any downtime, then stay with NAT Gateway
- If bandwidth requirement is <= 5Gbps, then switch to NAT Instances
- If bandwidth requirement is > 5Gbps, then...
- If your throughput requirement is known and and very high (around 50 Gbps or more), switch to NAT Instances
- If your data processing cost is very high, switch to NAT Instances (more ingress than egress)
- None of the above? Stay with NAT Gateway
Choosing the right instance type
- The smallest instance type is t4g.nanowhich give you a sustained throughput of 32 Mbps, burstable up to 5 Gbps. This is good for development environments or non-production environments. This will cost you around $3/month.
- For sustained 1 Gbps throughput, you can use c6gn.mediuminstance type. Thesegnline up of instances are meant for networking. These will cost you around $50/month.
- For 5 Gbps throughput, you can use c7gn.largeinstance type. These will cost you around $132/month.
- If you want to go higher, then AWS has known increments for instance types - 6 Gbps, 12 Gbps, 25 Gbps, 50 Gbps and 100 Gbps. You can choose the instance type accordingly.
Note - All instances below 32 vCPU can only provide sustained throughput of at most 5 Gbps. (Docs). This means that once your bandwidth requirement exceeds 5 Gbps, the price goes up significantly since your instances will have more than 32 vCPUs, most of which will be sitting idle.
In my case, for development environments, I used t4g.nano instances. For production, I used c7gn.large instances.
Implementation
I use the opensource fck-nat modules. They have proved to be reliable thus far with high-availability. Their features are very well worth it. If you're using IaC (Infra as Code), then their implementation details are in this link.
In their high-availability mode, they use a EC2 Auto-Scaling Group with minimum 1 instance. When the instance goes down due to any reason, a new instance is created automatically. This instance has a user-script to find the ENI by matching its own name, and then attaches the ENI to itself by using the AWS CLI. This way, the route table does not need to be updated, and the downtime is minimal (max 2 minutes in my testing).
Alternatively, you can create your own EC2 instance and use iptables to do the NAT. In my experience, the hassle is not worth it and fck-nat is a better choice.
My implementation
I used the fck-nat Terraform module. It is very simple and straightforward to use. Here is my implementation:
# fck-nat.tfmodule "fck-nat-prod-us-east-1a" {  source  = "RaJiska/fck-nat/aws"  version = "~> 1.3"
  name      = "fck-nat-ue1a"  vpc_id    = aws_vpc.fck-nat-vpc.id  subnet_id = aws_subnet.fck-nat-subnet-ue1a.id  providers = { aws = aws.useast1 }
  instance_type = "c7gn.large"
  update_route_tables = false  ha_mode             = true}
# rt-nat.tfresource "aws_route" "fck-nat-ue1a-route-nat" {  provider               = aws.useast1  route_table_id         = aws_route_table.fck-nat-route-table.id  destination_cidr_block = "0.0.0.0/0"  network_interface_id   = module.fck-nat-prod-us-east-1a.eni_id}
# rt.tfresource "aws_route_table" "fck-nat-route-table" {  vpc_id = aws_vpc.fck-nat-vpc.id}resource "aws_route_table_association" "fck-nat-rt-nat-association-ue1a" {  subnet_id      = aws_subnet.fck-nat-subnet-ue1a.id  route_table_id = aws_route_table.fck-nat-route-table.id}To Note -
- If you're using the fck-nat module, create one module per Availability Zone (AZ).
- Adding routes manually is better than letting the module do it. This way, you can control which subnets use the NAT Instance.
- Creating a separate route table for NAT instances is a good idea. When trying out, to avoid downtime, simply change the route table association for the subnets in case of any issues. This makes reverting changes easy.
- It is always better to test the infrastructure change in a non-production environment first.
Closing
By switching from NAT Gateways to NAT Instances, my company was able to save around 40% of our NAT Gateway costs. We continue to use NAT Gateway for some mission critical infrastructure, and since it sees less traffic, the overall savings have been significant.