AWS Prescriptive Guidance
Patterns

Migrate an on-premises Oracle database to Amazon ES using Logstash

R Type :RePlatform

source :Databases: Relational

target :Amazon ES

tags :search, oracle, amazon elasticsearch, logstash

categories :Business Software, Independent Software Vendor, Software Infrastructure

Summary

This pattern describes how to move data from an on-premises Oracle 12c database to Amazon Elasticsearch Service (Amazon ES) 6.3 using Logstash 6.4.3. It includes architectural considerations and some required skill sets and recommendations. The data can be from a single table or from multiple tables in which a full-text search will need to be performed. 

Amazon ES can be configured within a virtual private cloud (VPC), or it can be placed publicly with IP-based restrictions. This pattern describes a scenario where Amazon ES is configured within a VPC. Logstash is used to collect the data from the Oracle database, parse it to JSON format, and then feed the data into Amazon ES. 

Assumptions and Prerequisites

Prerequisites

  • An active AWS account

  • Logstash 6.4.3 requires Java 8

  • Connectivity between the on-premises database servers to Amazon Elastic Compute Cloud (Amazon EC2) instances in a VPC, established using AWS Virtual Private Network (AWS VPN)

  • A query to retrieve the required data to be pushed to Amazon ES from the database

  • Oracle Java Database Connectivity (JDBC) drivers

Limitations

  • Logstash cannot identify records that are hard-deleted from the database 

Architecture

Source technology stack

  • On-premises Oracle database

  • On-premises AWS VPN

Target technology stack

  • VPC

  • EC2 instance

  • Amazon ES

  • Logstash

  • NAT Gateway (for operating system updates on EC2 instances and to install Java 8, Logstash, and plugins)

Source and target architecture

Tools Used

Epics

Plan the migration

Tasks

Title Description Skills Predecessor
Identify the size of the source data. The size of the source data is one of the parameters that you use to determine the number of shards to be configured in an index. DBA, Database Developer
Analyze the data types of each column and corresponding data. Amazon ES dynamically maps the data type when a previously unseen field is found in the document. If there are any specific data types or formats (e.g., date fields) that need to be explicitly declared, identify the fields and define the mapping for those fields during index creation. AppOwner, Developer, Database Developer
Determine if there are any columns with primary or unique keys. To avoid duplication of records in Amazon ES during updates or inserts, you need to configure the document_id setting in the output section of the amazon_es plugin (e.g., document_id => "%{customer_id}" where customer_id is a primary key). AppOwner, Developer
Analyze the number and frequency of new records added; check how frequently the records are deleted. This task is required to understand the growth rate of source data. If data is intensively read and insertions are rare, you can have a single index. If new records are inserted frequently and there are no deletions, the shard size can easily exceed the recommended maximum size of 50 GB. In this case, you can dynamically create an index by configuring index patterns in Logstash and in the code where you can access it by using an alias. AppOwner, Developer
Determine how many replicas are required. AppOwner, Developer
Determine the number of shards to be configured on the index. AppOwner, Developer
Identify the instance types for dedicated master nodes, data nodes, and the EC2 instance. For more information, see the References and Help section. AppOwner, Developer
Determine the number of dedicated master nodes and data nodes required. For more information, see the References and Help section.

Migrate data

Tasks

Title Description Skills Predecessor
Launch an EC2 instance within the VPC to which AWS VPN is connected. Amazon VPC constructs AWS VPN
Install Logstash on the EC2 instance. Developer
Install the required Logstash plugins jdbc-input and logstash-output-amazon_es. Developer
Configure Logstash. Create the Logstash keystore to store sensitive information such as AWS Secrets Manager keys and database credentials, and then place the references in a Logstash configuration file. Developer
Configure the dead letter queue and persistent queue. By default, when Logstash encounters an event that it cannot process because the data contains a mapping error or some other issue, the Logstash pipeline either hangs or drops the unsuccessful event. To protect against data loss in this situation, you can configure Logstash to write unsuccessful events to a dead letter queue instead of dropping them. To protect against data loss during abnormal termination, Logstash has a persistent queue feature that will store the message queue on disk. Persistent queues provide the data durability in Logstash. Developer
Create the Amazon ES domain with an access policy that doesn't require signing requests with AWS Identity and Access Management (IAM) credentials. The Amazon ES domain must be created within the same VPC. You should also select the instance types and set the number of dedicated and master nodes based on your analysis. Developer
Configure the required Amazon ES logs. For more information, see https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/cloudwatch-alarms.html.
Create the index. Developer
Start Logstash. Run Logstash as a background service. Logstash runs the configured SQL query, pulls the data, converts it to JSON format, and feeds it to Amazon ES. For the initial load, do not configure the scheduler in the Logstash configuration file. Developer
Check the number of documents on the index and whether all documents are present in the source database. During initial load, they are added to the index and used to stop Logstash. Change the Logstash configuration to add a scheduler that runs at a fixed interval depending on client requirements, and restart Logstash. Logstash will pick only the records that were updated or added after the last run, and the last run timestamp is stored in the file configured with the property last_run_metadata_path => “/usr/share/logstash/.logstash_jdbc_last_run” in the Logstash configuration file. Developer

References and Help

References

Contact and help

Migration Pattern Library Support: aws-mpl@amazon.com