Amazon EMR, ALB & Me.

Table Of Contents

Amazon EMR, ALB & Me.

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. To ensure traffic to EMR is secured using Transport Layer Security, an AWS Application Load Balancer is required.

My preferred way to deploy Amazon EMR (covered in another post), along with the Application Load Balancer and any dependencies is using Terraform. The below Terraform code is tested & deployed using:

Terraform v0.12.29
+ provider.aws v3.15.0
+ provider.random v3.0.0

I’ve deployed EMR with the following applications and ports:

  • hadoop-hdfs-namenode - 50070
  • hadoop-hdfs-datanode - 50075
  • hbase - 16010
  • hue - 8888
  • jupyterhub - 9443
  • livy - 8998
  • spark - 18080
  • tez - 8080
  • yarn-node-manager - 8042
  • yarn-resource-manager - 8088
  • zeppelin - 8890

Prerequisites

  1. Terraform v0.12.29
  2. An AWS account in good standing
  3. AWS ACM (Amazon Certificate Manager issued, that covers the DNS namespace) - see line 6

Manifest

  1. AWS Application Load Balancer
  2. AWS security group for ingress traffic for AWS Application Load Balancer
  3. AWS Application Load Balancer listener (HTTPS)
  4. AWS Application Load Balancer listener rules (see above applications created)
  5. AWS Application Load Balancer target groups (see above applications created)

Terraform AWS Provider

  1. AWS Application Load Balancer
  2. AWS security group for ingress traffic for AWS Application Load Balancer
  3. AWS Application Load Balancer listener (HTTPS)
  4. AWS Application Load Balancer listener rules
  5. AWS Application Load Balancer target groups

Variables

Perform the following to correctly set the variables.

  1. Retrieve the code below, or adapt it however you see fit. (see below)
  2. Set your previously set AWS ACM certificate value in line #7, domain = "*.troydieter.com" for example
  3. Set the top level domain, in line #13 - for example, default = "emr.troydieter.com"
  4. Set the aws-profile variable to retrieve the credentials profile accordingly, in line #50 - otherwise it will use the default value.
  5. Set the aws_region variables if not already us-east-1 in line #55
  6. Set the environment variable in line #61 accordingly

Apply

  1. Use terraform plan to ensure the variables set are correct
  2. Use terraform apply to apply

Terraform Code

Download link

# EMR Load Balancer Creation
# www.troydieter.com

# Certificate and domain

data "aws_acm_certificate" "wildcard-cert" {
  domain   = "*.example.com"
  statuses = ["ISSUED"]
}

variable "domain" {
  type        = string
  default     = "emr.example.com"
  description = "The top level domain used for EMR"
}

resource "random_id" "lb-rand" {
  byte_length = 2
}

provider "aws" {
  profile = var.aws-profile
  region  = var.aws_region
}

# Tags

locals {
  emr-tags = {
    "parent_app"  = var.application
    "environment" = var.environment
  }
}

# Data sources
# Used for the default target group, send traffic to the NameNode

data "aws_lb_target_group" "emr-namenode" {
  name = "hadoop-hdfs-namenode-${random_id.lb-rand.hex}"
  depends_on = [ aws_lb_target_group.emr-tg ]
}

# Variables 

variable "application" {
  type    = string
  default = "EMR"
}

variable "aws-profile" {
  type        = string
  description = "AWS Profile used to deploy with"
}

variable "aws_region" {
  type        = string
  default     = "us-east-1"
  description = "Region"
}

variable "environment" {
  type        = string
  default     = "dev"
  description = "Environment you're deploying with"
}

variable "vpc_id" {
  type        = string
  description = "The VPC ID that the load balancer deploys to"
}

variable "cidr_block" {
  type        = string
  default     = "0.0.0.0/0"
  description = "CIDR Block of allowed ingress traffic"
}

variable elbsecpolicy {
  type        = string
  default     = "ELBSecurityPolicy-TLS-1-1-2017-01"
  description = "Applied AWS ELB policy"
}

# Example list (map) of AWS EMR applications used

variable emr-app {
  type = map
  default = {
    hadoop-hdfs-namenode  = "50070"
    hadoop-hdfs-datanode  = "50075"
    hbase                 = "16010"
    hue                   = "8888"
    jupyterhub            = "9443"
    livy                  = "8998"
    spark                 = "18080"
    tez                   = "8080"
    yarn-node-manager     = "8042"
    yarn-resource-manager = "8088"
    zeppelin              = "8890"
  }
}

# Import subnets

data "aws_subnet_ids" "alb-subnets" {
  vpc_id = var.vpc_id
}

# AWS Security Group
resource "aws_security_group" "lb_sg01" {
  name        = "${var.application}-${lower(var.environment)}-lb-sg01"
  description = "Allow inbound traffic to the ${upper(var.application)} load balancer"
  vpc_id      = var.vpc_id
  ingress {
    description = "LB"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["${var.cidr_block}"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  lifecycle {
    create_before_destroy = true
  }

}

# EMR Load Balancer

resource "aws_lb" "emr_lb" {
  name               = "${lower(var.application)}-${lower(var.environment)}-lb-${random_id.lb-rand.hex}"
  load_balancer_type = "application"
  subnets            = data.aws_subnet_ids.alb-subnets.ids
  security_groups    = [aws_security_group.lb_sg01.id]
  lifecycle {
    ignore_changes = [
      tags,
      access_logs
    ]
  }
  depends_on = [ aws_lb_target_group.emr-tg ]
  tags = local.emr-tags
}

resource "aws_lb_listener" "emr-443" {
  load_balancer_arn = aws_lb.emr_lb.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = var.elbsecpolicy
  certificate_arn   = data.aws_acm_certificate.wildcard-cert.arn
  default_action {
    type             = "forward"
    target_group_arn = data.aws_lb_target_group.emr-namenode.arn
  }
  depends_on = [ aws_lb_target_group.emr-tg ]
}

resource "aws_lb_listener_rule" "host_based_emr_routing" {
  for_each     = var.emr-app
  listener_arn = aws_lb_listener.emr-443.arn

  action {
    type             = "forward"
    target_group_arn = try(aws_lb_target_group.emr-tg[each.key].arn, "")
  }

  condition {
    host_header {
      values = ["${each.key}.${var.domain}"]
    }
  }
}

resource "aws_lb_target_group" "emr-tg" {
  for_each    = var.emr-app
  name        = "${each.key}-${random_id.lb-rand.hex}"
  port        = each.value
  target_type = "instance"
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  tags        = local.emr-tags
}
Share :

Related Posts

Reduce AWS costs while maintaining stable operations using this one weird trick!

Reduce AWS costs while maintaining stable operations using this one weird trick!

aws-auto-cleanup Functional Requirements Reduce operational run-time of resources used within an AWS account for testing\development Reduce cost due to deployed resources Ability to whitelist AWS resources that need to be retained Operating Cost < $2.00/mo for the following:

Read More
AWS Certified Solutions Architect: Associate - Study Guide

AWS Certified Solutions Architect: Associate - Study Guide

With scheduling my AWS Certified Solutions Architect: Professional for late September 2019, I figured i’d finally compile all of the notes and gathered content for the AWS Certified Solutions Architect: Associate.

Read More
AWS Certified Big Data: Specialty study guide

AWS Certified Big Data: Specialty study guide

AWS Certified Big Data: Specialty study outline In another installment of study blueprints for AWS certification exams; I am happy to provide my suggested outline for what I used to pass the AWS Certified Big Data Specialty certification in December 2019.

Read More