Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue - AWS Prescriptive Guidance

Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue

Created by Nikolai Kolesnikov (AWS), Karthiga Priya Chandran (AWS), and Samir Patel (AWS)

Environment: Production

Source: Cassandra

Target: Amazon Keyspaces

R Type: N/A

Workload: Open-source; All other workloads

Technologies: Analytics; Migration; Serverless; Big data

AWS services: AWS Glue; Amazon Keyspaces; Amazon S3; AWS CloudShell

Summary

This pattern shows you how to migrate your existing Apache Cassandra workloads to Amazon Keyspaces (for Apache Cassandra) by using CQLReplicator on AWS Glue. You can use CQLReplicator on AWS Glue to minimize the replication lag of migrating your workloads down to a matter of minutes. You also learn how to use an Amazon Simple Storage Service (Amazon S3) bucket to store data required for the migration, including Apache Parquet files, configuration files, and scripts. This pattern assumes that your Cassandra workloads are hosted on Amazon Elastic Compute Cloud (Amazon EC2) instances in a virtual private cloud (VPC).

Prerequisites and limitations

Prerequisites

  • Cassandra cluster with a source table

  • Target table in Amazon Keyspaces to replicate the workload

  • S3 bucket to store intermediate Parquet files that contain incremental data changes

  • S3 bucket to store job configuration files and scripts

Limitations

  • CQLReplicator on AWS Glue requires some time to provision Data Processing Units (DPUs) for the Cassandra workloads. The replication lag between the Cassandra cluster and the target keyspace and table in Amazon Keyspaces is likely to last for only a matter of minutes.

Architecture

Source technology stack  

  • Apache Cassandra

  • DataStax Server

  • ScyllaDB

Target technology stack  

  • Amazon Keyspaces

Migration architecture  

The following diagram shows an example architecture where a Cassandra cluster is hosted on EC2 instances and spread across three Availability Zones. The Cassandra nodes are hosted in private subnets.

Custom service role, Amazon Keyspaces, and Amazon S3, with AWS Glue connecting to the nodes VPC.

The diagram shows the following workflow:

  1. A custom service role provides access to Amazon Keyspaces and the S3 bucket.

  2. An AWS Glue job reads the job configuration and scripts in the S3 bucket.

  3. The AWS Glue job connects through port 9042 to read data from the Cassandra cluster.

  4. The AWS Glue job connects through port 9142 to write data to Amazon Keyspaces.

Tools

AWS services and tools

  • AWS Command Line Interface (AWS CLI) is an open-source tool that helps you interact with AWS services through commands in your command-line shell.

  • AWS CloudShell is a browser-based shell that you can use to manage AWS services by using the AWS Command Line Interface (AWS CLI) and a range of preinstalled development tools.

  • AWS Glue is a fully managed ETL service that helps you reliably categorize, clean, enrich, and move data between data stores and data streams.

  • Amazon Keyspaces (for Apache Cassandra) is a managed database service that helps you migrate, run, and scale your Cassandra workloads in the AWS Cloud.

Code

The code for this pattern is available in the GitHub CQLReplicator repository.

Best practices

  • To determine the necessary AWS Glue resources for the migration, estimate the number of rows in the source Cassandra table. For example, 250 K rows per 0.25 DPU (2 vCPUs, 4 GB of memory) with 84 GB disk.

  • Pre-warm Amazon Keyspaces tables before running CQLReplicator. For example, eight CQLReplicator tiles (AWS Glue jobs) can write up to 22 K WCUs per second, so the target should be pre-warmed up to 25-30 K WCUs per second.

  • To enable communication between AWS Glue components, use a self-referencing inbound rule for all TCP ports in your security group.

  • Use the incremental traffic strategy to distribute the migration workload over time.

Epics

TaskDescriptionSkills required

Create a target keyspace and table.

  1. Create a keyspace and table in Amazon Keyspaces.

    For more information on write capacity, see Write unit calculations in the Additional information section of this pattern.

    You can also create a keyspace by using the Cassandra Query Language (CQL). For more information, see Create a keyspace by using CQL in the Additional information section of this pattern.

    Note: After you create the table, consider switching the table to on-demand capacity mode to avoid unnecessary charges.

  2. To update to throughput mode, run the following script:

    ALTER TABLE target_keyspace.target_table WITH CUSTOM_PROPERTIES = { 'capacity_mode':{ 'throughput_mode':'PAY_PER_REQUEST'} }
App owner, AWS administrator, DBA, App developer

Configure the Cassandra driver to connect to Cassandra.

Use the following configuration script:

Datastax-java-driver { basic.request.consistency = “LOCAL_QUORUM” basic.contact-points = [“127.0.0.1:9042”] advanced.reconnect-on-init = true basic.load-balancing-policy { local-datacenter = “datacenter1” } advanced.auth-provider = { class = PlainTextAuthProvider username = “user-at-sample” password = “S@MPLE=PASSWORD=” } }

Note: The preceding script uses the Spark Cassandra Connector. For more information, see the reference configuration for Cassandra.

DBA

Configure the Cassandra driver to connect to Amazon Keyspaces.

Use the following configuration script:

datastax-java-driver { basic { load-balancing-policy { local-datacenter = us-west-2 } contact-points = [ "cassandra.us-west-2.amazonaws.com:9142" ] request { page-size = 2500 timeout = 360 seconds consistency = LOCAL_QUORUM } } advanced { control-connection { timeout = 360 seconds } session-leak.threshold = 6 connection { connect-timeout = 360 seconds init-query-timeout = 360 seconds warn-on-init-error = false } auth-provider = { class = software.aws.mcs.auth.SigV4AuthProvider aws-region = us-west-2 } ssl-engine-factory { class = DefaultSslEngineFactory } } }

Note: The preceding script uses the Spark Cassandra Connector. For more information, see the reference configuration for Cassandra.

DBA

Create an IAM role for the AWS Glue job.

Create a new AWS service role named glue-cassandra-migration with AWS Glue as a trusted entity.

Note: The glue-cassandra-migration should provide read and write access to the S3 bucket and Amazon Keyspaces. The S3 bucket contains the .jar files, configuration files for Amazon Keyspaces and Cassandra, and the intermediate Parquet files. For example, it contains the AWSGlueServiceRole, AmazonS3FullAccess, and AmazonKeyspacesFullAccess managed policies.

AWS DevOps

Download CQLReplicator in AWS CloudShell.

Download the project to your home folder by running the following command:

git clone https://github.com/aws-samples/cql-replicator.git cd cql-replicator/glue # Only for AWS CloudShell, the bc package includes bc and dc. Bc is an arbitrary precision numeric processing arithmetic language sudo yum install bc -y

Modify the reference configuration files.

Copy CassandraConnector.conf and KeyspacesConnector.conf to the ../glue/conf directory in the project folder.

AWS DevOps

Initiate the migration process.

The following command initializes the CQLReplicator environment. Initializaition involves copying .jar artifacts, and creating an AWS Glue connector, an S3 bucket, an AWS Glue job, the migration keyspace, and the ledger table:

cd cql-replicator/glue/bin ./cqlreplicator --state init --sg '"sg-1","sg-2"' \ --subnet "subnet-XXXXXXXXXXXX" \ --az us-west-2a --region us-west-2 \ --glue-iam-role glue-cassandra-migration \ --landing-zone s3://cql-replicator-1234567890-us-west-2

The script includes the following parameters:

  • --sg – The security groups that allow access to the Cassandra cluster from AWS Glue and include the self-referencing inbound rule for all traffic

  • --subnet – The subnet to which the Cassandra cluster belongs

  • --az – The Availability Zone of the subnet

  • --region – The AWS Region where the Cassandra cluster is deployed

  • --glue-iam-role – The IAM role permissions that AWS Glue can assume when calling Amazon Keyspaces and Amazon S3 on your behalf

  • --landing zone – An optional parameter for reusing an S3 bucket (If you don't supply a value for the --landing zone parameter, the init process will try to create a new bucket to store the configuration files, .jar artifacts, and intermediate files.)

AWS DevOps

Validate the deployment.

After you run the previous command, the AWS account should contain the following:

  • The CQLReplicator AWS Glue job and the AWS Glue connector in AWS Glue

  • The S3 bucket that stores the artifacts

  • The target keyspace migration and the ledger table in Amazon Keyspaces

AWS DevOps
TaskDescriptionSkills required

Start the migration process.

To operate CQLReplicator on AWS Glue, you need to use the --state run command, followed by a series of parameters. The precise configuration of these parameters is primarily determined by your unique migration requirements. For example, these settings might vary if you choose to replicate time to live (TTL) values and updates, or you offload objects exceeding 1 MB to Amazon S3.

To replicate the workload from the Cassandra cluster to Amazon Keyspaces, run the following command:

./cqlreplicator --state run --tiles 8 \ --landing-zone s3://cql-replicator-1234567890-us-west-2 \ --region us-west-2 \ --src-keyspace source_keyspace \ --src-table source_table \ --trg-keyspace taget_keyspace \ --writetime-column column_name \ --trg-table target_table --inc-traffic

Your source keyspace and table are source_keyspace.source_table in the Cassandra cluster. Your target keyspace and table are target_keyspace.target_table in Amazon Keyspaces. The parameter --inc-traffic helps prevent incremental traffic from overloading the Cassandra cluster and Amazon Keyspaces with a high number of requests.

To replicate updates, add --writetime-column regular_column_name to your command line. The regular column is going to be used as the source of the write timestamp.

AWS DevOps
TaskDescriptionSkills required

Validate migrated Cassandra rows during the historical migration phase.

To obtain the number of rows replicated during the backfilling phase, run the following command:

./cqlreplicator --state stats \ --landing-zone s3://cql-replicator-1234567890-us-west-2 \ --src-keyspace source_keyspace --src-table source_table --region us-west-2
AWS DevOps
TaskDescriptionSkills required

Use the cqlreplicator command or the AWS Glue console.

To stop the migration process gracefully, run the following command:

./cqlreplicator --state request-stop --tiles 8 \ --landing-zone s3://cql-replicator-1234567890-us-west-2 \ --region us-west-2 \ --src-keyspace source_keyspace --src-table source_table

To stop the migration process immediately, use the AWS Glue console.

AWS DevOps
TaskDescriptionSkills required

Delete the deployed resources.

The following command will delete the AWS Glue job, connector, S3 bucket, and Keyspaces table ledger:

./cqlreplicator --state cleanup --landing-zone s3://cql-replicator-1234567890-us-west-2
AWS DevOps

Troubleshooting

IssueSolution

AWS Glue jobs failed and returned an Out of Memory (OOM) error.

  1. Change the worker type (scale up). For example, change G0.25X to G.1X or G.1X to G.2X. Alternatively, increase the number of DPUs per AWS Glue job (scale out) in CQLReplicator.

  2. Start the migration process from the point where it was interrupted. To restart failed CQLReplicator jobs, rerun the --state run command with the same parameters.

Related resources

Additional information

Migration considerations

You can use AWS Glue to migrate your Cassandra workload to Amazon Keyspaces, while keeping your Cassandra source databases completely functional during the migration process. After the replication is complete, you can choose to cut over your applications to Amazon Keyspaces with minimal replication lag (less than minutes) between the Cassandra cluster and Amazon Keyspaces. To maintain data consistency, you can also use a similar pipeline to replicate the data back to the Cassandra cluster from Amazon Keyspaces.

Write unit calculations

As an example, consider that you intend to write 500,000,000 with the row size 1 KiB during one hour. The total number of Amazon Keyspaces write units (WCUs) that you require is based on this calculation:

(number of rows/60 mins 60s) 1 WCU per row = (500,000,000/(60*60s) * 1 WCU) = 69,444 WCUs required

69,444 WCUs per second is the rate for 1 hour, but you could add some cushion for overhead.  For example, 69,444 * 1.10 = 76,388 WCUs has 10 percent overhead.

Create a keyspace by using CQL

To create a keyspace by using CQL, run the following commands:

CREATE KEYSPACE target_keyspace WITH replication = {'class': 'SingleRegionStrategy'} CREATE TABLE target_keyspace.target_table ( userid uuid, level text, gameid int, description text, nickname text, zip text, email text, updatetime text, PRIMARY KEY (userid, level, gameid) ) WITH default_time_to_live = 0 AND CUSTOM_PROPERTIES = {'capacity_mode':{ 'throughput_mode':'PROVISIONED', 'write_capacity_units':76388, 'read_capacity_units':3612 }} AND CLUSTERING ORDER BY (level ASC, gameid ASC)