Utilize random_shuffle to improve AWS availability zone spread when deploying with Terraform
random_shuffle
In my repository, event-driven-msk
(shown here) - an Amazon VPC is deployed, along with subnets for private & public. A part of that requires a region selection (defined in your provider.tf
file), along with availability zone selection.
History
Prior to discovering random_shuffle
- I used this:
# Provisions the VPC for MSK
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "msk-vpc"
cidr = "172.16.16.0/20"
azs = ["${var.aws_region}a", "${var.aws_region}b", "${var.aws_region}c"]
private_subnets = ["172.16.16.0/25", "172.16.17.0/25", "172.16.18.0/25"]
public_subnets = ["172.16.16.128/25", "172.16.17.128/25", "172.16.18.128/25"]
enable_nat_gateway = true
enable_vpn_gateway = true
tags = local.common-tags
}
As you can see, I am defining the azs
argument in the module using an interpolation expression and appending a character. This isn’t a desirable way to do this, as it is static and not dynamic.
Improvement
Here comes random_shuffle
- Terraform Docs here
# Provisions the VPC for MSK
data "aws_availability_zones" "available" {
state = "available"
}
resource "random_shuffle" "az" {
input = data.aws_availability_zones.available.names
result_count = 3
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "msk-vpc"
cidr = "172.16.16.0/20"
azs = ["${element(random_shuffle.az.result, 0)}", "${element(random_shuffle.az.result, 1)}", "${element(random_shuffle.az.result, 2)}"]
private_subnets = ["172.16.16.0/25", "172.16.17.0/25", "172.16.18.0/25"]
public_subnets = ["172.16.16.128/25", "172.16.17.128/25", "172.16.18.128/25"]
enable_nat_gateway = true
single_nat_gateway = true
one_nat_gateway_per_az = false
enable_vpn_gateway = true
enable_ipv6 = true
tags = local.common-tags
public_subnet_tags = {
connectivity = "public"
}
private_subnet_tags = {
connectivity = "private"
}
}
As you can see above, we are using the element
function along with the random_shuffle
attribute “result” to select random values from the data source, aws_availability_zones
.
Gotcha’s
MSK (Managed Services for Kafka) is not available on all availability zones. In my case, deploying to us-east-1
- us-east-1e
does not support MSK:
╷
│ Error: error creating MSK Cluster (data-platform-dev-48fd): BadRequestException: One or more subnets belong to unsupported availability zones: [us-east-1e].
│ {
│ RespMetadata: {
│ StatusCode: 400,
│ RequestID: "56571475-52e0-44d6-abdd-3acaa4e7b1ca"
│ },
│ InvalidParameter: "brokerNodeGroupInfo",
│ Message_: "One or more subnets belong to unsupported availability zones: [us-east-1e]."
│ }
│
│ with aws_msk_cluster.data_platform,
│ on data_platform_msk.tf line 96, in resource "aws_msk_cluster" "data_platform":
│ 96: resource "aws_msk_cluster" "data_platform" {
│
╵
How do we work around this? The best way I’ve found is to use the exclude_zone_ids
argument in the aws_availability_zones
data source:
data "aws_availability_zones" "available" {
state = "available"
exclude_zone_ids = ["${var.aws_region}e"]
}
Result
Users who deploy this will now deploy to random availability zones in their AWS region, and no longer always deploy to AZ-A, AZ-B & AZ-C. Future improvements will include a better handling of MSK-forbidden availability zones.