Migrate an on-premises Apache Kafka cluster to Amazon MSK by using MirrorMaker
Created by Han Zhang (AWS) and Tanner Pratt (AWS)
Environment: PoC or pilot | Source: On-premises or self-managed Apache Kafka cluster | Target: Amazon Managed Streaming for Apache Kafka (Amazon MSK) |
R Type: Replatform | Workload: Open-source; All other workloads | Technologies: Analytics; Big data; Migration |
AWS services: Amazon MSK |
Summary
This pattern provides guidance for migrating an on-premises, self-managed, or hosted Apache Kafka cluster to Amazon Managed Streaming for Apache Kafka (Amazon MSK). You can also use this pattern to migrate from one Amazon MSK cluster to another.
Apache Kafka includes the MirrorMaker feature, which replicates data between two Kafka clusters. MirrorMaker consists of a collection of consumers, which are part of a consumer group. The consumers read data from the topics in the source cluster and then pass this data to producers, which write the data to the target cluster.
The Amazon MSK documentation contains a high-level overview of the process to use MirrorMaker version 1.0 to migrate on-premises Kafka clusters to Amazon MSK. This pattern supplements this information by offering comprehensive, step-by-step instructions for using MirrorMaker version 2.0.
Prerequisites and limitations
Prerequisites
An active AWS account
A Kafka source cluster that is one of the following:
In an on-premises data center
Self-managed in the cloud
Hosted through a partner
Limitations
To use MirrorMaker version 2.0, the source cluster must be operating Apache Kafka version 2.4.0 or later. For earlier versions, see the instructions in the Amazon MSK documentation in order to use MirrorMaker version 1.0.
Product versions
MirrorMaker version 2.0
Apache Kafka version 2.4.0 or later. For more information about the versions of Apache Kafka that Amazon MSK supports, see Supported Apache Kafka versions.
Architecture
Source technology stack
On-premises or self-managed Kafka cluster
Target technology stack
Amazon MSK cluster
Target architecture
The diagram shows the following process:
MirrorMaker reads the data from the topics and consumer groups in the source Kafka cluster.
MirrorMaker replicates the data and consumer information to the target Amazon MSK cluster.
Tools
AWS services
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS Cloud. You can launch as many virtual servers as you need and quickly scale them up or down.
Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that helps you build and run applications that use Apache Kafka to process streaming data.
Amazon Virtual Private Cloud (Amazon VPC) helps you launch AWS resources into a virtual network that you’ve defined. This virtual network resembles a traditional network that you’d operate in your own data center, with the benefits of using the scalable infrastructure of AWS.
Other tools
Apache Kafka
is an open-source event streaming platform. In this pattern, you use the MirrorMaker feature of Kafka to perform the cross-cluster migration.
Best practices
You can run MirrorMaker on in either the source or target environments, but it's recommended that you run it as close as possible to the target cluster. For more information, see Best Practice: Consume from Remote, Produce to Local
Epics
Task | Description | Skills required |
---|---|---|
Create a VPC. |
| AWS systems administrator, DevOps engineer, Cloud administrator |
Create the Amazon MSK cluster. | Create an Amazon MSK cluster. For instructions, see Creating a cluster using the AWS Management Console or Creating a cluster using the AWS CLI. Configure the cluster to use the VPC and subnets that you created previously. | AWS systems administrator, DevOps engineer, Cloud administrator |
Task | Description | Skills required |
---|---|---|
Install MirrorMaker. |
Note: In this pattern, you install MirrorMaker 2.0 as a dedicated MirrorMaker cluster on an Amazon EC2 instance. This option is acceptable for development environments and is the approach used in this pattern. For more information about other deployment options for MirrorMaker 2.0, see the Additional information section of this pattern. | AWS systems administrator, Cloud administrator, DevOps engineer |
Specify Kafka cluster information. | In the Kafka client installation | AWS systems administrator, Cloud administrator, DevOps engineer |
Start MirrorMaker. | Enter the following command to start MirrorMaker and pass the mm2.properties file.
| AWS systems administrator, Cloud administrator, DevOps engineer |
Monitor the progress. | Check the progress by inspecting the lag between the last offset for each topic and the current offset for the topic MirrorMaker is consuming. For instructions, see Monitoring Geo-Replication | AWS systems administrator, Cloud administrator, DevOps engineer |
Task | Description | Skills required |
---|---|---|
Stop the consumer applications. | Stop all consumer applications that consume data from the source cluster. | App developer |
Start the consumer applications. | Alter the applications bootstrap configuration to point to the destination cluster. Then begin consuming on the target cluster. | App developer |
Stop the producers on the source cluster. | When the consumer applications are successfully consuming on the target cluster, stop the producers on the source cluster. | App developer |
Start the producers on the target cluster. | Alter the producer's configuration bootstrap servers, and point to the target cluster. Wait for MirrorMaker to finish mirroring all data from source cluster before starting the producers. | App developer |
Stop MirrorMaker. | After producers have moved to the target cluster, stop MirrorMaker. | AWS systems administrator, Cloud administrator, DevOps engineer |
Related resources
AWS resources
Migrating clusters using MirrorMaker (Amazon MSK documentation)
Amazon MSK migration labs
(AWS workshop studio)
Other resources
MirrorMaker 2.0
(Apache Kafka Improvement Proposals) Geo-Replication: Cross-Cluster Data Mirroring
(Apache Kafka documentation)
Additional information
This pattern runs MirrorMaker 2.0 as a dedicated MirrorMaker cluster on Amazon EC2. This option is acceptable for development environments. Although it is not discussed in this pattern, you can also run MirrorMaker 2.0 in a Kafka Connect cluster. This deployment option uses a framework within the Kafka ecosystem that improves scaling and maintenance. You deploy the connector into a Kafka Connect cluster with the associated configuration to run the application. The connector can run in standalone mode for development or testing or in distributed mode for production. For more information, see Running MirrorMaker in a Connect cluster