AWS Certified Big Data: Specialty study outline
In another installment of study blueprints for AWS certification exams; I am happy to provide my suggested outline for what I used to pass the AWS Certified Big Data Specialty certification in December 2019.
Deep-dive re:Invent videos:
- AWS re:Invent 2018: Big Data Analytics Architectural Patterns & Best Practices (ANT201-R1)
- AWS re:Invent 2018: Effective Data Lakes: Challenges and Design Patterns (ANT316)
- AWS re:Invent 2018: High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R1)
- AWS re:Invent 2018: [REPEAT 1] A Deep Dive into What's New with Amazon EMR (ANT340-R1)
- Amazon Redshift Masterclass
- Big Data Analytics Options on AWS (January 2016) PDF
- Streaming Data Solutions on AWS with Amazon Kinesis (July 2017) PDF
- Data Warehousing on AWS (March 2016) PDF
- Best Practices for Amazon EMR (August 2013) PDF
- EMR security :
Data at rest
- Data residing on Amazon S3—S3 client-side encryption with EMR
- Data residing on disk—the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using Linux Unified Key System (LUKS)
Data in transit
- Data in transit from E,MR to S3, or vice versa—S3 client-side encryption with EMR
- Data in transit between nodes in a cluster—in-transit encryption via Secure Sockets Layer (SSL) for MapReduce and Simple Authentication and Security Layer (SASL) for Spark shuffle encryption
- Data being spilled to disk or cached during a shuffle phase—Spark shuffle encryption or LUKS encryption
EMR Consistent View: EMRFS consistent view is an optional feature available when using Amazon EMR release version 3.2.1 or later. Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS. When you create a cluster with consistent view enabled, Amazon EMR uses an Amazon DynamoDB database to store object metadata and track consistency with Amazon S3. If consistent view determines that Amazon S3 is inconsistent during a file system operation, it retries that operation according to rules that you can define. By default, the DynamoDB database has 500 read capacity and 100 write capacity. You can configure read/write capacity settings depending on the number of objects that EMRFS tracks and the number of nodes concurrently using the metadata.
Use Apache Zeppelin as a notebook for interactive data exploration.Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results.
The Ganglia open source project is a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on their performance. When you enable Ganglia on your cluster, you can generate reports and view the performance of the cluster as a whole, as well as inspect the performance of individual node instances. Ganglia is also configured to ingest and visualize Hadoop and Spark metrics.
Apache Cassandra and Apache HBase are columnar Database. Compare to Dynamo DB, Apache HBase is much more flexible in terms of what you can store (size and data type wise). Apache HBase gives you the option to have very flexible row key data types, whereas DynamoDB only allows scalar types for the primary key attributes. DynamoDB, on the other hand, provides very easy creation and maintenance of secondary indexes, something that you have to do manually in Apache HBase.
In EMR, The default input format for a cluster is text files with each line separated by a newline (\n) character, which is the input format most commonly used. If your input data is in a format other than the default text files, you can use the Hadoop interface toInputFormat specify other input types.
If you are using Hive, you can use a serializer/deserializer (SerDe) to read data in from a given format into HDFS.
Hue (Hadoop User Experience) is an open-source, web-based, graphical user interface for use with Amazon EMR and Apache Hadoop. Hue groups together several different Hadoop ecosystem projects into a configurable interface. Hue acts as a front-end for applications that run on your cluster, allowing you to interact with applications using an interface that may be more familiar or user-friendly. The applications in Hue, such as the Hive and Pig editors, replace the need to log in to the cluster to run scripts interactively using each application's respective shell. After a cluster launches, you might interact entirely with applications using Hue or a similar interface.
Amazon EMR supports Apache Mahout, a machine learning framework for Apache Hadoop. Mahout is a machine learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most-similar items or build item recommendations for users. Mahout employs the Hadoop framework to distribute calculations across a cluster, and now includes additional work distribution methods, including Spark.
You can run your EMR cluster as a transient process: one that launches the cluster, loads the input data, processes the data, stores the output results, and then automatically shuts down. This is the standard model for a cluster that performs a periodic processing task. Shutting down the cluster automatically ensures that you are only billed for the time required to process your data.
Spark has micro-batching but can guarantee only-once-delivery if configured. Spark's four modules are MLlib, SparkSQL, Spark Streaming, and GraphX.
When your cluster runs, Hadoop creates a number of map and reduce tasks. These determine the number of tasks that can run simultaneously during your cluster. Run too few tasks and you have nodes sitting idle; run too many and there is significant framework overhead.
If you want that data to be encrypted in-transit between nodes, then Hadoop encrypted shuffle has to be setup. Encrypted Shuffle capability allows encryption of the MapReduce shuffle using HTTPS. When you select the in-transit encryption checkbox in the EMR security configuration, Hadoop Encrypted Shuffle is automatically setup for you upon cluster launch
Do not use Spark for batch processing. With Spark, there is minimal disk I/O, and the data being queried from the multiple data stores needs to fit into memory. Queries that require a lot of memory can fail. For large batch jobs, consider Hive. Also avoid using Spark for multi-user reporting with many concurrent requests.
When you create a table, you designate one of three distribution styles; EVEN, KEY, or ALL.
The leader node distributes the rows across the slices in a round-robin fashion , regardless of the values in any particular column. **EVEN distribution is appropriate when a table does not participate in joins ** or when there is not a clear choice between KEY distribution and ALL distribution. EVEN distribution is the default distribution style.
Key distribution : The rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.
**ALL distribution : A copy of the entire table is distributed to every node. ** Where EVEN distribution or KEY distribution place only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in.
ALL distribution multiplies the storage required by the number of nodes in the cluster, and so it takes much longer to load, update, or insert data into multiple tables. ALL distribution is appropriate only for relatively slow-moving tables; that is, tables that are not updated frequently or extensively. Small dimension tables do not benefit significantly from ALL distribution, because the cost of redistribution is low.
RedShift UNLOAD: Unloads the result of a query to one or more files on S3, using Amazon S3 server-side encryption (SSE-S3). You can also specify server-side encryption with an AWS Key Management Service key (SSE-KMS) or client-side encryption with a customer-managed key (CSE-CMK).You can manage the size of files on Amazon S3, and, by extension, the number of files, by setting the MAXFILESIZE parameter.
Number of the slice in Redshift Node
o ds2.xlarge , dc1.large, dc2.large = 2
o ds2.8xlarge, dc2.8xlarge = 16
o dc1.8xlarge = 32
For Redshift COPY command, Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster.
To validate the data in the Amazon S3 input files or Amazon DynamoDB table before you actually load the data in Redshift, use the NOLOAD option with the COPY command. Use NOLOAD with the same COPY commands you use to actually load the data. NOLOAD checks the integrity of all of the data without loading it into the database. The NOLOAD option displays any errors ins same manner as that would occur if you had attempted to load the data.
There are two types of snapshots: automated and manual. Amazon Redshift stores these snapshots internally in Amazon S3 by using an encrypted Secure Sockets Layer (SSL) connection. If you need to restore from a snapshot, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify. Read more - https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html
You can use a manifest to ensure that the COPY command loads all of the required files, and only the required files, for a data load. Instead of supplying an object path for the COPY command, you supply the name of a JSON-formatted text file that explicitly lists the files to be loaded. The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix. You can use a manifest to load files from different buckets or files that do not share the same prefix.
RedshiftCopyActivity : Copies data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table. In addition, RedshiftCopyActivity let's you work with an S3DataNode, since it supports a manifest file.
In Redshift, You can efficiently add new data to an existing table by using a combination of updates and inserts from a staging table. While Amazon Redshift does not support a single merge, or upsert, command to update a table from a single data source, you can perform a merge operation by creating a staging table
Redshift cluster resizing - https://docs.aws.amazon.com/redshift/latest/mgmt/rs-resize-tutorial.html
EXPLAIN command Displays the execution plan for a query statement without running the query in Redshift
By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. You can define up to eight queues. Each queue can be configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
In Redshift , If you enable SQA, you can reduce or eliminate workload management (WLM) queues that are dedicated to running short queries. In addition, long-running queries don't need to contend with short queries for slots in a queue, so you can configure your WLM queues to use fewer query slots. When you use lower concurrency, query throughput is increased and overall system performance is improved for most workloads.
In RedShift , Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries. SQA executes short-running queries in a dedicated space, so that SQA queries aren't forced to wait in queues behind longer queries. With SQA, short-running queries begin running more quickly and users see results sooner.
In Redshift The CASE expression is a conditional expression, similar to if/then/else statements found in other languages. CASE is used to specify a result when there are multiple conditions.
In Redshift , You enable encryption when you launch a cluster. To migrate from an unencrypted cluster to an encrypted cluster, you first unload your data from the existing, source cluster. Then you reload the data in a new, target cluster with the chosen encryption setting. During the migration process, your source cluster is available for read-only queries until the last step. The last step is to rename the target and source clusters, which switches endpoints so all traffic is routed to the new, target cluster
Amazon Redshift logs information about connections and user activities in your database. These logs help you to monitor the database for security and troubleshooting purposes, for database auditing. The logs are stored in the S3 buckets for convenient access with data security features for users who are responsible for monitoring activities in the database.
S3, DynamoDB, and EMR/EC2 instances directly integrate with Redshift using the COPY command.
The UNLOAD ENCRYPTED command automatically stores the data encrypted using client side encryption and uses HTTPS to encrypt the data during the transfer to S3.If you want to ensure files are automatically encrypted on S3 with server-side encryption, no special action is needed. The unload command automatically creates files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE-S3).
For RedShift data , bzip2 compression algorithm has the highest compression ratio. ** **
To manually migrate an Amazon Redshift cluster to another AWS account , follow these steps:
Create a manual snapshot of the cluster you want to migrate.
Share the cluster snapshot with another AWS account to view and restore the snapshot.
Before you copy a snapshot to another region, first enable cross-region snapshots.
In the destination AWS account, restore the shared cluster snapshot.
The unit of data stored by Kinesis Data Streams is a data record. A stream represents a group of data records. The data records in a stream are distributed into shards.
A shard has a sequence of data records in a stream. When you create a stream, you specify the number of shards for the stream. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second. Shards also support up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The total capacity of a stream is the sum of the capacities of its shards. You can increase or decrease the number of shards in a stream as needed. However, you are charged on a per-shard basis.
If you have sensitive data, you can enable server-side data encryption when you use Amazon Kinesis Data Firehose. However, this is only possible if you use a Kinesis stream as your data source. When you configure a Kinesis stream as the data source of a Kinesis Data Firehose delivery stream, Kinesis Data Firehose no longer stores the data at rest. Instead, the data is stored in the Kinesis stream.
When you send data from your data producers to your Kinesis stream, the Kinesis Data Streams service encrypts your data using an AWS KMS key before storing it at rest. When your Kinesis Data Firehose delivery stream reads the data from your Kinesis stream, the Kinesis Data Streams service first decrypts the data and then sends it to Kinesis Data Firehose. Kinesis Data Firehose buffers the data in memory based on the buffering hints that you specify and then delivers it to your destinations without storing the unencrypted data at rest.
In Kinesis , To prevent skipped records, handle all exceptions within processRecords appropriately.
For each Amazon Kinesis Data Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application's state. Because the KCL uses the name of the Amazon Kinesis Data Streams application to create the name of the table, each application name must be unique.
If your Amazon Kinesis Data Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. The KCL creates the table with a provisioned throughput of 10 reads per second and 10 writes per second, but this might not be sufficient for your application. For example, if your Amazon Kinesis Data Streams application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.
** ** PutRecord returns the shard ID of where the data record was placed and the sequence number that was assigned to the data record. Sequence numbers increase over time and are specific to a shard within a stream, not across all shards within a stream. To guarantee strictly increasing ordering, write serially to a shard and use the SequenceNumberForOrdering parameter.
Details regarding Kinesis records :
For live streaming Kinesis gets ruled out if record size greater than 1 MB , in that case Kafka can support bigger records.
You can trigger One lambda per shard. If you want to use Lambda with Kinesis Streams, you need to create Lambda functions to automatically read batches of records off your Amazon Kinesis stream and process them if records are detected on the stream. AWS Lambda then polls the stream periodically (once per second) for new records.
In Kinesis stream ,the PutRecordBatch() operation can take up to 500 records per call or 4 MB per call, whichever is smaller. Buffer size ranges from 1 MB to 128 MB.
In circumstances where data delivery to the destination is falling behind data ingestion into the delivery stream, Amazon Kinesis Firehose raises the buffer size automatically to catch up and make sure that all data is delivered to the destination.
If data delivery to Redshift fail from Kinesis Firehose , Amazon Kinesis Firehose retries data delivery every 5 minutes for up to a maximum period of 60 minutes. After 60 minutes, Amazon Kinesis Firehose skips the current batch of S3 objects that are ready for COPY and moves on to the next batch. The information about the skipped objects is delivered to your S3 bucket as a manifest file in the errors folder, which you can use for manual backfill. For information about how to COPY data manually with manifest files, see Using a Manifest to Specify Data Files.
If data delivery to your Amazon S3 bucket fails , Amazon Kinesis Firehose retries to deliver data every 5 seconds for up to a maximum period of 24 hours. If the issue continues beyond the 24-hour maximum retention period, it discards the data.
Aggregation refers to the storage of multiple records in a Streams record. Aggregation allows customers to increase the number of records sent per API call, which effectively increases producer throughput._ Aggregation Storing multiple records within a single Kinesis Data Streams record while Collection_ using the API operation PutRecords to send multiple Kinesis Data Streams records to one or more shards in your Kinesis data stream.You can first aggregate stream record and then send them to stream using collection putrecords() in multiple shard.
** ** Spark Streaming uses the Kinesis Client Library (KCL) to consume data from a Kinesis stream. KCL handles complex tasks like load balancing, failure recovery, and check-pointing
The Amazon Kinesis Connector Library helps Java developers integrate Amazon Kinesis with other AWS and non-AWS services. The current version of the library provides connectors for Amazon DynamoDB, Amazon Redshift, Amazon S3, Elasticsearch.
- Amazon ML Key component:
o Datasources contain metadata associated with data inputs to Amazon ML
o ML Models generate predictions using the patterns extracted from the input data
o Evaluations measure the quality of ML models
o Batch Predictions asynchronously generate predictions for multiple input data observations
o Real-time Predictions synchronously generate predictions for individual data observations
Amazon ML supports three types of ML models: binary classification, multiclass (Categorial) classification, and regression. - https://docs.aws.amazon.com/machine-learning/latest/dg/types-of-ml-models.html
AUCArea Under the Curve (AUC) measures the ability of a binary ML model to predict a higher score for positive examples as compared to negative examples.Macro-averaged F1-scoreThe macro-averaged F1-score is used to evaluate the predictive performance of multiclass ML models.RMSEThe Root Mean Square Error (RMSE) is a metric used to evaluate the predictive performance of regression ML models.
A lower Area Under Curve (AUC) reduces accuracy of the prediction; AUC values well below 0.5 may indicate a problem with the data. The F1 score's range is 0 to 1. A larger value indicates better predictive accuracy.
Amazon ML is limited to 100 'categorical' recommendations . Amazon ML can support up to 100 GB of data.
You must upload your input data to S3 because Amazon ML reads data from Amazon S3 locations. You can upload your data directly to Amazon S3, or Amazon ML can copy data that you've stored in Redshift or RDS into a .csv file and upload it to S3.
ElasticSearch is suitable to analyze large set of streaming data from Kinesis.
Elastic Search Snapshots are backups of a cluster's data and state. They provide a convenient way to migrate data across Amazon Elasticsearch Service domains and recover from failure. The service supports restoring from snapshots taken on both Amazon ES domains and self-managed Elasticsearch clusters.
Amazon ES takes daily automated snapshots of the primary index shards in a domain, as described in Configuring Automatic Snapshots. The service stores up to 14 of these snapshots for no more than 30 days in a preconfigured Amazon S3 bucket at no additional charge to you. You can use these snapshots to restore the domain.
IoT and Others
For AWS IoT - Devices connect using your choice of identity (X.509 certificates, IAM users and groups, Amazon Cognito identities, or custom authentication tokens) over a secure connection according to the AWS IoT connection model. Note that KMS is not in list.
AWS IoT rule actions specify what to do when a rule is triggered. You can create actions for the following services: CloudWatch, DynamoDB, Elasticsearch, Kinesis Firehose, Kinesis Streams, Kinesis Firehose, Lambda, S3, SNS, and SQS. Not for any relational store like Aurora and Redshift.
The Rules Engine transforms messages using a SQL-based syntax. The Device Gateway (Message Broker) uses topics to route messages from publishing clients to subscribing clients. Then a message is published to the topic, the SQL statement is evaluated and rule action is triggered, sending the message to another AWS service.
In AWS Data Pipeline, a precondition is a pipeline component containing conditional statements that must be true before an activity can run. For example, a precondition can check whether source data is present before a pipeline activity attempts to copy it. AWS Data Pipeline provides several pre-packaged preconditions that accommodate common scenarios, such as whether a database table exists, whether an Amazon S3 key is present, and so on. However, preconditions are extensible and allow you to run your own custom scripts to support endless combinations.
AWS Data Pipeline supports A JDBC database , A RDS Database and Redshift . computational resource that perform the work on EC2 and EMR cluster.
Data Pipeline integrate with on-premise servers. AWS provides you with a Task Runner package that you install on your on-premise hosts. Once installed, the package polls Data Pipeline for work to perform. If it detects that an activity needs to run on your on-premise host (based on the schedule in Data Pipeline), the Task Runner will issue the appropriate command to run the activity, which can be running a stored procedure or a database dump or another database activity.
A QuickSight Dashboard is a read-only snapshot of an analysis. You can share it with other Quicksight users for reporting. The data in the dashboard reflects the data set that is used by the analysis. If you share a dashboard with users, they can then view and filter the dashboard data, but they cannot save any filters applied to the dashboard.
Number of Dynamo DB partition = readCapacityUnit/3000 + WriteCapacityUnit/1000. 10 GB data per partition after that it create new partition.
Details on GSI Vs LSI in Dynamo DB - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html
COPY command used for Loading Data from DynamoDB Into Amazon Redshift.
DAX (DynamoDB Accelerator)is a DynamoDB-compatible caching service that enables you to benefit from fast in-memory performance for demanding application.
A VPC endpoint for DynamoDB and S3 enables Amazon EC2 instances in your VPC to use their private IP addresses to access DynamoDB/S3 with no exposure to the public Internet.
DynamoDB Streams supports the following stream record views:
- KEYS_ONLY—Only the key attributes of the modified item
- NEW_IMAGE—The entire item, as it appears after it was modified
- OLD_IMAGE—The entire item, as it appears before it was modified
- NEW_AND_OLD_IMAGES—Both the new and the old images of the item
DynamoDB supports many different data types for attributes within a table. They can be categorized as follows:
o Scalar Types – A scalar type can represent exactly one value. The scalar types are number, string, binary, Boolean, and null.
o Document Types – A document type can represent a complex structure with nested attributes—such as you would find in a JSON document. The document types are list and map.
o Set Types – A set type can represent multiple scalar values. The set types are string set, number set, and binary set.