Amazon EMR, ALB & Me.
- Troy Dieter
- Aws
- November 15, 2020
Table Of Contents
Amazon EMR, ALB & Me.
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. To ensure traffic to EMR is secured using Transport Layer Security, an AWS Application Load Balancer is required.
My preferred way to deploy Amazon EMR (covered in another post), along with the Application Load Balancer and any dependencies is using Terraform. The below Terraform code is tested & deployed using:
Terraform v0.12.29
+ provider.aws v3.15.0
+ provider.random v3.0.0
I’ve deployed EMR with the following applications and ports:
- hadoop-hdfs-namenode - 50070
- hadoop-hdfs-datanode - 50075
- hbase - 16010
- hue - 8888
- jupyterhub - 9443
- livy - 8998
- spark - 18080
- tez - 8080
- yarn-node-manager - 8042
- yarn-resource-manager - 8088
- zeppelin - 8890
Prerequisites
- Terraform v0.12.29
- An AWS account in good standing
- AWS ACM (Amazon Certificate Manager issued, that covers the DNS namespace) - see line 6
Manifest
- AWS Application Load Balancer
- AWS security group for ingress traffic for AWS Application Load Balancer
- AWS Application Load Balancer listener (HTTPS)
- AWS Application Load Balancer listener rules (see above applications created)
- AWS Application Load Balancer target groups (see above applications created)
Terraform AWS Provider
- AWS Application Load Balancer
- AWS security group for ingress traffic for AWS Application Load Balancer
- AWS Application Load Balancer listener (HTTPS)
- AWS Application Load Balancer listener rules
- AWS Application Load Balancer target groups
Variables
Perform the following to correctly set the variables.
- Retrieve the code below, or adapt it however you see fit. (see below)
- Set your previously set AWS ACM certificate value in line #7,
domain = "*.troydieter.com"for example - Set the top level domain, in line #13 - for example,
default = "emr.troydieter.com" - Set the
aws-profilevariable to retrieve the credentials profile accordingly, in line #50 - otherwise it will use the default value. - Set the
aws_regionvariables if not already us-east-1 in line #55 - Set the
environmentvariable in line #61 accordingly
Apply
- Use
terraform planto ensure the variables set are correct - Use
terraform applyto apply
Terraform Code
# EMR Load Balancer Creation
# www.troydieter.com
# Certificate and domain
data "aws_acm_certificate" "wildcard-cert" {
domain = "*.example.com"
statuses = ["ISSUED"]
}
variable "domain" {
type = string
default = "emr.example.com"
description = "The top level domain used for EMR"
}
resource "random_id" "lb-rand" {
byte_length = 2
}
provider "aws" {
profile = var.aws-profile
region = var.aws_region
}
# Tags
locals {
emr-tags = {
"parent_app" = var.application
"environment" = var.environment
}
}
# Data sources
# Used for the default target group, send traffic to the NameNode
data "aws_lb_target_group" "emr-namenode" {
name = "hadoop-hdfs-namenode-${random_id.lb-rand.hex}"
depends_on = [ aws_lb_target_group.emr-tg ]
}
# Variables
variable "application" {
type = string
default = "EMR"
}
variable "aws-profile" {
type = string
description = "AWS Profile used to deploy with"
}
variable "aws_region" {
type = string
default = "us-east-1"
description = "Region"
}
variable "environment" {
type = string
default = "dev"
description = "Environment you're deploying with"
}
variable "vpc_id" {
type = string
description = "The VPC ID that the load balancer deploys to"
}
variable "cidr_block" {
type = string
default = "0.0.0.0/0"
description = "CIDR Block of allowed ingress traffic"
}
variable elbsecpolicy {
type = string
default = "ELBSecurityPolicy-TLS-1-1-2017-01"
description = "Applied AWS ELB policy"
}
# Example list (map) of AWS EMR applications used
variable emr-app {
type = map
default = {
hadoop-hdfs-namenode = "50070"
hadoop-hdfs-datanode = "50075"
hbase = "16010"
hue = "8888"
jupyterhub = "9443"
livy = "8998"
spark = "18080"
tez = "8080"
yarn-node-manager = "8042"
yarn-resource-manager = "8088"
zeppelin = "8890"
}
}
# Import subnets
data "aws_subnet_ids" "alb-subnets" {
vpc_id = var.vpc_id
}
# AWS Security Group
resource "aws_security_group" "lb_sg01" {
name = "${var.application}-${lower(var.environment)}-lb-sg01"
description = "Allow inbound traffic to the ${upper(var.application)} load balancer"
vpc_id = var.vpc_id
ingress {
description = "LB"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["${var.cidr_block}"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
lifecycle {
create_before_destroy = true
}
}
# EMR Load Balancer
resource "aws_lb" "emr_lb" {
name = "${lower(var.application)}-${lower(var.environment)}-lb-${random_id.lb-rand.hex}"
load_balancer_type = "application"
subnets = data.aws_subnet_ids.alb-subnets.ids
security_groups = [aws_security_group.lb_sg01.id]
lifecycle {
ignore_changes = [
tags,
access_logs
]
}
depends_on = [ aws_lb_target_group.emr-tg ]
tags = local.emr-tags
}
resource "aws_lb_listener" "emr-443" {
load_balancer_arn = aws_lb.emr_lb.arn
port = 443
protocol = "HTTPS"
ssl_policy = var.elbsecpolicy
certificate_arn = data.aws_acm_certificate.wildcard-cert.arn
default_action {
type = "forward"
target_group_arn = data.aws_lb_target_group.emr-namenode.arn
}
depends_on = [ aws_lb_target_group.emr-tg ]
}
resource "aws_lb_listener_rule" "host_based_emr_routing" {
for_each = var.emr-app
listener_arn = aws_lb_listener.emr-443.arn
action {
type = "forward"
target_group_arn = try(aws_lb_target_group.emr-tg[each.key].arn, "")
}
condition {
host_header {
values = ["${each.key}.${var.domain}"]
}
}
}
resource "aws_lb_target_group" "emr-tg" {
for_each = var.emr-app
name = "${each.key}-${random_id.lb-rand.hex}"
port = each.value
target_type = "instance"
protocol = "HTTP"
vpc_id = var.vpc_id
tags = local.emr-tags
}


