Summary Prerequisites and limitations Architecture Tools Epics Related resources

Migrate an on-premises Oracle database to Amazon OpenSearch Service using Logstash

Created by Aditya Goteti (AWS)

Summary

This pattern describes how to move data from an on-premises Oracle database to Amazon OpenSearch Service using Logstash. It includes architectural considerations and some required skill sets and recommendations. The data can be from a single table or from multiple tables in which a full-text search will need to be performed.

OpenSearch Service can be configured within a virtual private cloud (VPC), or it can be placed publicly with IP-based restrictions. This pattern describes a scenario where OpenSearch Service is configured within a VPC. Logstash is used to collect the data from the Oracle database, parse it to JSON format, and then feed the data into OpenSearch Service.

Prerequisites and limitations

Prerequisites

An active AWS account
Java 8 (required by Logstash 6.4.3)
Connectivity between the on-premises database servers and Amazon Elastic Compute Cloud (Amazon EC2) instances in a VPC, established using AWS Virtual Private Network (AWS VPN)
A query to retrieve the required data to be pushed to OpenSearch Service from the database
Oracle Java Database Connectivity (JDBC) drivers

Limitations

Logstash cannot identify records that are hard-deleted from the database

Product versions

Oracle Database 12c
OpenSearch Service 6.3
Logstash 6.4.3

Architecture

Source technology stack

On-premises Oracle database
On-premises AWS VPN

Target technology stack

VPC
EC2 instance
OpenSearch Service
Logstash
NAT Gateway (for operating system updates on EC2 instances and to install Java 8, Logstash, and plugins)

Data migration architecture

How to move data from an on-premises Oracle database to Amazon OpenSearch Service using Logstash.

Tools

Logstash 6.4.3
JDBC input plugin (download and more information)
Logstash output plugin (logstash-output-amazon_es)
Oracle JDBC drivers

Epics

Task	Description	Skills required
Identify the size of the source data.	The size of the source data is one of the parameters that you use to determine the number of shards to be configured in an index.	DBA, Database developer
Analyze the data types of each column and corresponding data.	OpenSearch Service dynamically maps the data type when a previously unseen field is found in the document. If there are any specific data types or formats (for example, date fields) that need to be explicitly declared, identify the fields and define the mapping for those fields during index creation.	App owner, Developer, Database developer
Determine if there are any columns with primary or unique keys.	To avoid duplication of records in Amazon OpenSearch Service during updates or inserts, you need to configure the `document_id` setting in the output section of the `amazon_es` plugin (for example, `document_id => "%{customer_id}"` where `customer_id` is a primary key).	App owner, Developer
Analyze the number and frequency of new records added; check how frequently the records are deleted.	This task is required to understand the growth rate of source data. If data is intensively read and insertions are rare, you can have a single index. If new records are inserted frequently and there are no deletions, the shard size can easily exceed the recommended maximum size of 50 GB. In this case, you can dynamically create an index by configuring index patterns in Logstash and in the code where you can access it by using an alias.	App owner, Developer
Determine how many replicas are required.		App owner, Developer
Determine the number of shards to be configured on the index.		App owner, Developer
Identify the instance types for dedicated master nodes, data nodes, and the EC2 instance.	For more information, see the Related resources section.	App owner, Developer
Determine the number of dedicated master nodes and data nodes required.	For more information, see the Related resources section.

Task	Description	Skills required
Launch an EC2 instance.	Launch an EC2 instance within the VPC to which AWS VPN is connected.	Amazon VPC constructs, AWS VPN
Install Logstash on the EC2 instance.		Developer
Install the Logstash plugins.	Install the required Logstash plugins `jdbc-input` and `logstash-output-amazon_es`.	Developer
Configure Logstash.	Create the Logstash keystore to store sensitive information such as AWS Secrets Manager keys and database credentials, and then place the references in a Logstash configuration file.	Developer
Configure the dead letter queue and persistent queue.	By default, when Logstash encounters an event that it cannot process because the data contains a mapping error or some other issue, the Logstash pipeline either hangs or drops the unsuccessful event. To protect against data loss in this situation, you can configure Logstash to write unsuccessful events to a dead letter queue instead of dropping them. To protect against data loss during abnormal termination, Logstash has a persistent queue feature that will store the message queue on disk. Persistent queues provide the data durability in Logstash.	Developer
Create the Amazon OpenSearch Service domain.	Create the Amazon OpenSearch Service domain with an access policy that doesn't require signing requests with AWS Identity and Access Management (IAM) credentials. The Amazon OpenSearch Service domain must be created within the same VPC. You should also select the instance types and set the number of dedicated and master nodes based on your analysis.	Developer
Configure the required Amazon OpenSearch Service logs.	For more information, see the OpenSearch Service documentation.
Create the index.		Developer
Start Logstash.	Run Logstash as a background service. Logstash runs the configured SQL query, pulls the data, converts it to JSON format, and feeds it to OpenSearch Service. For the initial load, do not configure the scheduler in the Logstash configuration file.	Developer
Check documents.	Check the number of documents on the index and whether all documents are present in the source database. During initial load, they are added to the index and used to stop Logstash. Change the Logstash configuration to add a scheduler that runs at a fixed interval depending on client requirements, and restart Logstash. Logstash will pick only the records that were updated or added after the last run, and the last run timestamp is stored in the file that is configured with the property `last_run_metadata_path => “/usr/share/logstash/.logstash_jdbc_last_run”` in the Logstash configuration file.	Developer

Related resources

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Migrate from Oracle on Amazon EC2 to Amazon RDS for Oracle

Migrate from Oracle to Amazon RDS for Oracle