Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Step 4: Prepare the source data and the target table in Amazon Keyspaces

Focus mode
Step 4: Prepare the source data and the target table in Amazon Keyspaces - Amazon Keyspaces (for Apache Cassandra)

In this step, you create a source file with sample data and an Amazon Keyspaces table.

  1. Create the source file. You can choose one of the following options:

    • For this tutorial, you use a comma-separated values (CSV) file with the name keyspaces_sample_table.csv as the source file for the data migration. The provided sample file contains a few rows of data for a table with the name book_awards.

      1. Download the sample CSV file (keyspaces_sample_table.csv) that is contained in the following archive file samplemigration.zip. Unzip the archive and take note of the path to keyspaces_sample_table.csv.

    • If you want to follow along with your own CSV file to write data to Amazon Keyspaces, make sure that the data is randomized. Data that is read directly from a database or exported to flat files is typically ordered by the partition and primary key. Importing ordered data to Amazon Keyspaces can cause it to be written to smaller segments of Amazon Keyspaces partitions, which results in an uneven traffic distribution. This can lead to slower performance and higher error rates.

      In contrast, randomizing data helps to take advantage of the built-in load balancing capabilities of Amazon Keyspaces by distributing traffic across partitions more evenly. There are various tools that you can use for randomizing data. For an example that uses the open-source tool Shuf, see Step 2: Prepare the data to upload using DSBulk in the data migration tutorial. The following is an example that shows how to shuffle data as a DataFrame.

      import org.apache.spark.sql.functions.randval shuffledDF = dataframe.orderBy(rand())
  2. Create the target keyspace and table in Amazon Keyspaces.

    1. Connect to Amazon Keyspaces using cqlsh, and replace the service endpoint, user name, and password in the following example with your own values.

      cqlsh cassandra.us-east-2.amazonaws.com 9142 -u "111122223333" -p "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" --ssl
    2. Create a new keyspace with the name catalog as shown in the following example.

      CREATE KEYSPACE catalog WITH REPLICATION = {'class': 'SingleRegionStrategy'};
    3. After the new keyspace has a status of available, use the following code to create the target table book_awards. To learn more about asynchronous resource creation and how to check if a resource is available, see Check keyspace creation status in Amazon Keyspaces.

      CREATE TABLE catalog.book_awards ( year int, award text, rank int, category text, book_title text, author text, publisher text, PRIMARY KEY ((year, award), category, rank) );
PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.