Expand your search using AWS native services to identify, comprehend and securely store documents.

Table Of Contents

The document debacle

Companies continue to fight the battle of the age-old problem: paper documents. Adapting to document modernization to expand the ability to search, catalog and protect HIPAA\PII data is paramount. In this article, we will cover how you can integrate a server-less pipeline within AWS to tackle the document debacle!

Scenario

In the below solution architecture, we will cover data being securely migrated from an on-premise data center to the AWS Cloud. Networking components such as AWS Direct Connect are used to ensure data securely traverses the networking fabric to its destination. The assumption that the data is in a report style format, raw text, Adobe PDF or image based (.PNG, .JPG). This solution can be implemented as a one-time use of a forklift of data, or as replication system in batches over time.

The end-user does not need to be concerned with the process to convert the data, as the server-less pipeline handles all of the data ETL (extract, transform & load). Elasticsearch, when paired with Kibana offers an immensely powerful tool for searching large datasets. It is based on the Apache Lucene engine and is suitable for large document indexing and search capabilities.

Solution overview

DocSearch Solution Overview

Components

  1. Data resides on-premise and is in a format supported for conversion. AWS DataSync is deployed to a conventional operating system and is levied to export the data securely.
  2. Data traverses an AWS Direct Connect to ensure the transit remains private and does not traverse public internet space.
  3. The VPC endpoint is the ingress point of the VPC, facilitating the secure path.
  4. The Amazon DataSync service is configured, with agents running in private subnets within the VPC. The DataSync agent will receive the data and process it. In this case, it will be sent to the destination Amazon S3 bucket for processing.
  5. Data is sent within the VPC (privately) to the Amazon S3 bucket. An Amazon S3 endpoint is used to ensure traffic does not leave the VPC. Objects are encrypted in-flight set by the Amazon S3 bucket policy, while the stored S3 bucket objects are encrypted using AWS KMS encryption at-rest.
  6. An Amazon Lambda function(s) run to process the data in batches, that have landed in the Amazon S3 bucket. Multiple AWS components facilitate the analysis of the data.
  7. An Amazon Lambda function(s) run to extract the data in batches, now sent from the previous function. Multiple AWS components facilitate the extraction of the data.
  8. Amazon ElasticSearch service stores the extracted data in an encrypted (at rest and in transit) index. This data now can be used to be searched internally using the Elasticsearch API or Kibana. Amazon Cognito is used to secure the login process, along with integrating SSO if required.
  9. Kibana is used to overlay Elasticsearch and provides user friendly search expressions, dashboards and tools.

🔍 Employees can now retrieve records & documents much more easily, while using a single interface!

Share :

Related Posts

HashiCorp Terraform AWS Provider v3.4.0 now supports aws_emr_managed_scaling_policy

HashiCorp Terraform AWS Provider v3.4.0 now supports aws_emr_managed_scaling_policy

HashiCorp Terraform AWS-Provider Issue #13952 was highly sought after for a recent implementation of EMR v5.30.0. The requirements included the need for the utilization of AWS Auto Scaling for EMR. We sought out the AWS EMR Managed Scaling feature, but were sad to see that support for that attribute was not in the AWS provider yet.

Read More
Reduce AWS costs while maintaining stable operations using this one weird trick!

Reduce AWS costs while maintaining stable operations using this one weird trick!

aws-auto-cleanup Functional Requirements Reduce operational run-time of resources used within an AWS account for testing\development Reduce cost due to deployed resources Ability to whitelist AWS resources that need to be retained Operating Cost < $2.00/mo for the following:

Read More
Amazon EMR, ALB & Me.

Amazon EMR, ALB & Me.

Amazon EMR, ALB & Me. Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. To ensure traffic to EMR is secured using Transport Layer Security, an AWS Application Load Balancer is required.

Read More