Use write sharding to evenly distribute workloads across partitions - Amazon Keyspaces (for Apache Cassandra)

Use write sharding to evenly distribute workloads across partitions

One way to better distribute writes across a partition in Amazon Keyspaces is to expand the space. You can do this in several different ways. You can add an additional partition key column to which you write random numbers to distribute the rows among partitions. Or you can use a number that is calculated based on something that you're querying on.

Sharding using compound partition keys and random values

One strategy for distributing loads more evenly across a partition is to add an additional partition key column to which you write random numbers. Then you randomize the writes across the larger space.

For example, consider the following table which has a single partition key representing a date.

CREATE TABLE IF NOT EXISTS tracker.blogs ( publish_date date, title text, description int, PRIMARY KEY (publish_date));

To more evenly distribute this table across partitions, you could include an additional partition key column shard that stores random numbers. For example:

CREATE TABLE IF NOT EXISTS tracker.blogs ( publish_date date, shard int, title text, description int, PRIMARY KEY ((publish_date, shard)));

When inserting data you might choose a random number between 1 and 200 for the shard column. This yields compound partition key values like (2020-07-09, 1), (2020-07-09, 2), and so on, through (2020-07-09, 200). Because you are randomizing the partition key, the writes to the table on each day are spread evenly across multiple partitions. This results in better parallelism and higher overall throughput.

However, to read all the rows for a given day, you would have to query the rows for all the shards and then merge the results. For example, you would first issue a SELECT statement for the partition key value (2020-07-09, 1). Then issue another SELECT statement for (2020-07-09, 2), and so on, through (2020-07-09, 200). Finally, your application would have to merge the results from all those SELECT statements.

Sharding using compound partition keys and calculated values

A randomizing strategy can greatly improve write throughput. But it's difficult to read a specific row because you don't know which value was written to the shard column when the row was written. To make it easier to read individual rows, you can use a different strategy. Instead of using a random number to distribute the rows among partitions, use a number that you can calculate based upon something that you want to query on.

Consider the previous example, in which a table uses today's date in the partition key. Now suppose that each row has an accessible title column, and that you most often need to find rows by title in addition to date. Before your application writes the row to the table, it could calculate a hash value based on the title and use it to populate the shard column. The calculation might generate a number between 1 and 200 that is fairly evenly distributed, similar to what the random strategy produces.

A simple calculation would likely suffice, such as the product of the UTF-8 code point values for the characters in the title, modulo 200, + 1. The compound partition key value would then be the combination of the date and calculation result.

With this strategy, the writes are spread evenly across the partition key values, and thus across the physical partitions. You can easily perform a SELECT statement for a particular row and date because you can calculate the partition key value for a specific title value.

To read all the rows for a given day, you still must SELECT each of the (2020-07-09, N) keys (where N is 1–200), and your application then has to merge all the results. The benefit is that you avoid having a single "hot" partition key value taking all of the workload.