Amazon EMR, ALB & Me.

Share on:

Amazon EMR, ALB & Me.

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. To ensure traffic to EMR is secured using Transport Layer Security, an AWS Application Load Balancer is required.

My preferred way to deploy Amazon EMR (covered in another post), along with the Application Load Balancer and any dependencies is using Terraform. The below Terraform code is tested & deployed using:

Terraform v0.12.29
+ provider.aws v3.15.0
+ provider.random v3.0.0

I’ve deployed EMR with the following applications and ports:

  • hadoop-hdfs-namenode - 50070
  • hadoop-hdfs-datanode - 50075
  • hbase - 16010
  • hue - 8888
  • jupyterhub - 9443
  • livy - 8998
  • spark - 18080
  • tez - 8080
  • yarn-node-manager - 8042
  • yarn-resource-manager - 8088
  • zeppelin - 8890

Prerequisites

  1. Terraform v0.12.29
  2. An AWS account in good standing
  3. AWS ACM (Amazon Certificate Manager issued, that covers the DNS namespace) - see line 6

Manifest

  1. AWS Application Load Balancer
  2. AWS security group for ingress traffic for AWS Application Load Balancer
  3. AWS Application Load Balancer listener (HTTPS)
  4. AWS Application Load Balancer listener rules (see above applications created)
  5. AWS Application Load Balancer target groups (see above applications created)

Terraform AWS Provider

  1. AWS Application Load Balancer
  2. AWS security group for ingress traffic for AWS Application Load Balancer
  3. AWS Application Load Balancer listener (HTTPS)
  4. AWS Application Load Balancer listener rules
  5. AWS Application Load Balancer target groups

Variables

Perform the following to correctly set the variables.

  1. Retrieve the code below, or adapt it however you see fit. (see below)
  2. Set your previously set AWS ACM certificate value in line #7, domain = "*.troydieter.com" for example
  3. Set the top level domain, in line #13 - for example, default = "emr.troydieter.com"
  4. Set the aws-profile variable to retrieve the credentials profile accordingly, in line #50 - otherwise it will use the default value.
  5. Set the aws_region variables if not already us-east-1 in line #55
  6. Set the environment variable in line #61 accordingly

Apply

  1. Use terraform plan to ensure the variables set are correct
  2. Use terraform apply to apply

Terraform Code

Download link

  1# EMR Load Balancer Creation
  2# www.troydieter.com
  3
  4# Certificate and domain
  5
  6data "aws_acm_certificate" "wildcard-cert" {
  7  domain   = "*.example.com"
  8  statuses = ["ISSUED"]
  9}
 10
 11variable "domain" {
 12  type        = string
 13  default     = "emr.example.com"
 14  description = "The top level domain used for EMR"
 15}
 16
 17resource "random_id" "lb-rand" {
 18  byte_length = 2
 19}
 20
 21provider "aws" {
 22  profile = var.aws-profile
 23  region  = var.aws_region
 24}
 25
 26# Tags
 27
 28locals {
 29  emr-tags = {
 30    "parent_app"  = var.application
 31    "environment" = var.environment
 32  }
 33}
 34
 35# Data sources
 36# Used for the default target group, send traffic to the NameNode
 37
 38data "aws_lb_target_group" "emr-namenode" {
 39  name = "hadoop-hdfs-namenode-${random_id.lb-rand.hex}"
 40  depends_on = [ aws_lb_target_group.emr-tg ]
 41}
 42
 43# Variables 
 44
 45variable "application" {
 46  type    = string
 47  default = "EMR"
 48}
 49
 50variable "aws-profile" {
 51  type        = string
 52  description = "AWS Profile used to deploy with"
 53}
 54
 55variable "aws_region" {
 56  type        = string
 57  default     = "us-east-1"
 58  description = "Region"
 59}
 60
 61variable "environment" {
 62  type        = string
 63  default     = "dev"
 64  description = "Environment you're deploying with"
 65}
 66
 67variable "vpc_id" {
 68  type        = string
 69  description = "The VPC ID that the load balancer deploys to"
 70}
 71
 72variable "cidr_block" {
 73  type        = string
 74  default     = "0.0.0.0/0"
 75  description = "CIDR Block of allowed ingress traffic"
 76}
 77
 78variable elbsecpolicy {
 79  type        = string
 80  default     = "ELBSecurityPolicy-TLS-1-1-2017-01"
 81  description = "Applied AWS ELB policy"
 82}
 83
 84# Example list (map) of AWS EMR applications used
 85
 86variable emr-app {
 87  type = map
 88  default = {
 89    hadoop-hdfs-namenode  = "50070"
 90    hadoop-hdfs-datanode  = "50075"
 91    hbase                 = "16010"
 92    hue                   = "8888"
 93    jupyterhub            = "9443"
 94    livy                  = "8998"
 95    spark                 = "18080"
 96    tez                   = "8080"
 97    yarn-node-manager     = "8042"
 98    yarn-resource-manager = "8088"
 99    zeppelin              = "8890"
100  }
101}
102
103# Import subnets
104
105data "aws_subnet_ids" "alb-subnets" {
106  vpc_id = var.vpc_id
107}
108
109# AWS Security Group
110resource "aws_security_group" "lb_sg01" {
111  name        = "${var.application}-${lower(var.environment)}-lb-sg01"
112  description = "Allow inbound traffic to the ${upper(var.application)} load balancer"
113  vpc_id      = var.vpc_id
114  ingress {
115    description = "LB"
116    from_port   = 443
117    to_port     = 443
118    protocol    = "tcp"
119    cidr_blocks = ["${var.cidr_block}"]
120  }
121
122  egress {
123    from_port   = 0
124    to_port     = 0
125    protocol    = "-1"
126    cidr_blocks = ["0.0.0.0/0"]
127  }
128
129  lifecycle {
130    create_before_destroy = true
131  }
132
133}
134
135# EMR Load Balancer
136
137resource "aws_lb" "emr_lb" {
138  name               = "${lower(var.application)}-${lower(var.environment)}-lb-${random_id.lb-rand.hex}"
139  load_balancer_type = "application"
140  subnets            = data.aws_subnet_ids.alb-subnets.ids
141  security_groups    = [aws_security_group.lb_sg01.id]
142  lifecycle {
143    ignore_changes = [
144      tags,
145      access_logs
146    ]
147  }
148  depends_on = [ aws_lb_target_group.emr-tg ]
149  tags = local.emr-tags
150}
151
152resource "aws_lb_listener" "emr-443" {
153  load_balancer_arn = aws_lb.emr_lb.arn
154  port              = 443
155  protocol          = "HTTPS"
156  ssl_policy        = var.elbsecpolicy
157  certificate_arn   = data.aws_acm_certificate.wildcard-cert.arn
158  default_action {
159    type             = "forward"
160    target_group_arn = data.aws_lb_target_group.emr-namenode.arn
161  }
162  depends_on = [ aws_lb_target_group.emr-tg ]
163}
164
165resource "aws_lb_listener_rule" "host_based_emr_routing" {
166  for_each     = var.emr-app
167  listener_arn = aws_lb_listener.emr-443.arn
168
169  action {
170    type             = "forward"
171    target_group_arn = aws_lb_target_group.emr-tg[each.key].arn
172  }
173
174  condition {
175    host_header {
176      values = ["${each.key}.${var.domain}"]
177    }
178  }
179}
180
181resource "aws_lb_target_group" "emr-tg" {
182  for_each    = var.emr-app
183  name        = "${each.key}-${random_id.lb-rand.hex}"
184  port        = each.value
185  target_type = "instance"
186  protocol    = "HTTP"
187  vpc_id      = var.vpc_id
188  tags        = local.emr-tags
189}