Elastic Search to AWS S3 Cloud Data Migration: A Comprehensive Guide | C4Scale
Elastic search, S3

Elastic Search to AWS S3 — Data Migration - A Glance!

Written By: Chakravarthy Varaga

Enterprises today need their data monetized, if not, leverage on the insights from the enormous amount of data to generate newer incomes from newer products. Building infrastructure, platforms to gather, clean, prepare, process, standardize and making the data (big data) available for heterogeneous stake holders amidst the ecosystem that is vastly huge, constantly changing, is a daunting task. The open source communities are churning out new components, frameworks in this ecosystem. The cloud providers like AWS, Azure, GCP release new services that lets the organization, adopting these services, focus on their business than building, managing these services. This article focuses on Data migration in the cloud and how flexible that could be using the managed services.

Requirement

  • A billion+ json documents from each environment (integration, performance, production) to move from Elastic Search to AWS S3.
  • json documents need to be flattened
  • stored in Parquet format
  • no latency restrictions

There are multiple solutions to this problem. Some of them are listed below:

ElasticSearch → LogStash → Apache Kafka → Apache Spark → S3

ElasticSearch → LogStash → Apache Kafka → Secor → S3

KafkaConnect can be used as another integration option, instead of Spark and Secor, however there are no officially supported Parquet sink converter/connector that can store in S3.

Secor is an open-source component from pinterest that can run as a service ingesting data from Kafka and provides out-of-box configuration options to tweak caching, data formats, partitioning etc.,

With both options, standing and operating the infrastructure for Kafka, Spark and Secor bears costs even if the migration is once off. Code needs to be written wiring these components and the efforts involved has costs associated with it. After all the infrastructure needs to be torn down if not used.

With Apache Spark, the programmatic control, with it’s distributed processing power, is a boon, as it gives fine-grained control over partitioning, ordering, caching etc., however the developer efforts involved including CI is inevitable.

Researching through some of the AWS services proved to be seamless and effortless for this once off activity.

ElasticSearch → LogStash → AWS Kinesis → AWS Kinesis Firehose (lambda, glue)→ S3

Logstash

Logstash — is a product from Elastic Search that lets you stash data in and out of Elastic Search. The plugin based programming model makes it easy to configure the input and the output of those plugins. Basically you can move data to other streaming, messaging systems like Kafka, Kinesis, NoSql databases etc., The configuration/code snippet is further down in this article.

Kinesis Firehose

Kinesis Data-Firehose — Data-Firehose is a managed delivery stream that lets you capture, transform data from streams, convert and store in a destination. It has a source, processing/transformation stream and a destination to store. In this case the source is Kinesis Stream that we created earlier (where the json data from ES will be ingested into through logstash). Firehose can batch, compress, encrypt data before storing the data. Firehose provides facility to perform the following pipeline work Data Source → Transform →Data Conversion →Store

Lambda (Transform)

Lambda — are Functions that are serverless, managed (compute service). You could write just the code you want to execute without worrying about its deployment, management, operations, provisioning of servers. The processing (flattening json document), in the firehose, is run by Lambda. Python, Node.js, Ruby, .Net are widely used platforms while Java is a latest addition in this list. Creating a lambda from the console is simple as well. Choose the platform (python in our case) to run, writing the code (below) and fire away.

AWS Glue (Data conversion)

AWS Glue is an ETL service from AWS that includes meta-data (table definition, schema) management in what is called as Data Catalog. The transformed data from the Lambda needs to be converted to parquet format. Firehose supports out-of-box serde formats to convert to Parquet or ORC. Data conversion to Parquet needs a schema to confirm to. AWS Glue can be used to create a database and the schema through the console.

S3 (Store)

S3 (Simple Storage Service) — is an object store with 11 9s durability. Configuration of this is as simple as specifying the bucket (root directory), keys (directory like paths) where the data has to be stored, buffering etc., Now that the pipeline is created, it’s time to create the logstash configuration to move data. Apparently, there is no officially supported logstash kinesis output plugin however there’s this opensource plugin that could be used. Sample logstash configuration using this plugin is below:

Monitoring/Diagnosis

AWS Glue is an ETL service from AWS that includes meta-data (table definition, schema) management in what is called as Data Catalog. The transformed data from the Lambda needs to be converted to parquet format. Firehose supports out-of-box serde formats to convert to Parquet or ORC. Data conversion to Parquet needs a schema to confirm to. AWS Glue can be used to create a database and the schema through the console.

Performance/Throughput optimizations

Here are some parameters to consider tweaking to achieve an overall throughput or performance efficiency. That is another article ! LogStash(Pipeline workers, batch sizes, jvm memory). Kinesis Streams(Shard count). Lambda (Reducing the processing time is key,The max., concurrent executions of lambda is equal to max shard count.) Firehose (Buffer sizes, timeouts determine the file sizes on S3).

Instantiating the services took me roughly a day. Ready to migrate !

#aws#cloud data migration#elastic search
...

© 2023, Gettr Technologies (OPC) Pvt. Ltd. All Rights Reserved.