Detecting and correcting index key violations in DynamoDB
During the backfill phase of global secondary index creation, Amazon DynamoDB examines each item in the table to determine whether it is eligible for inclusion in the index. Some items might not be eligible because they would cause index key violations. In these cases, the items remain in the table, but the index doesn't have a corresponding entry for that item.
An index key violation occurs in the following situations:
-
There is a data type mismatch between an attribute value and the index key schema data type. For example, suppose that one of the items in the
GameScores
table had aTopScore
value of typeString
. If you added a global secondary index with a partition key ofTopScore
, of typeNumber
, the item from the table would violate the index key. -
An attribute value from the table exceeds the maximum length for an index key attribute. The maximum length of a partition key is 2048 bytes, and the maximum length of a sort key is 1024 bytes. If any of the corresponding attribute values in the table exceed these limits, the item from the table would violate the index key.
Note
If a String or Binary attribute value is set for an attribute that is used as an index key, then the attribute value must have a length greater than zero;, otherwise, the item from the table would violate the index key.
This tool does not flag this index key violation, at this time.
If an index key violation occurs, the backfill phase continues without interruption. However, any violating items are not included in the index. After the backfill phase completes, all writes to items that violate the new index's key schema will be rejected.
To identify and fix attribute values in a table that violate an index key, use the Violation Detector tool. To run Violation Detector, you create a configuration file that specifies the name of a table to be scanned, the names and data types of the global secondary index partition key and sort key, and what actions to take if any index key violations are found. Violation Detector can run in one of two different modes:
-
Detection mode — Detect index key violations. Use detection mode to report the items in the table that would cause key violations in a global secondary index. (You can optionally request that these violating table items be deleted immediately when they are found.) The output from detection mode is written to a file, which you can use for further analysis.
-
Correction mode — Correct index key violations. In correction mode, Violation Detector reads an input file with the same format as the output file from detection mode. Correction mode reads the records from the input file and, for each record, it either deletes or updates the corresponding items in the table. (Note that if you choose to update the items, you must edit the input file and set appropriate values for these updates.)
Downloading and running Violation Detector
Violation Detector is available as an executable Java Archive (.jar
file), and
runs on Windows, macOS, or Linux computers. Violation Detector requires Java 1.7 (or later) and
Apache Maven.
Follow the instructions in the README.md
file to download and
install Violation Detector using Maven.
To start Violation Detector, go to the directory where you have built
ViolationDetector.java
and enter the following command.
java -jar ViolationDetector.jar [options]
The Violation Detector command line accepts the following options:
-
-h | --help
— Prints a usage summary and options for Violation Detector. -
-p | --configFilePath
value
— The fully qualified name of a Violation Detector configuration file. For more information, see The Violation Detector configuration file. -
-t | --detect
value
— Detect index key violations in the table, and write them to the Violation Detector output file. If the value of this parameter is set tokeep
, items with key violations are not modified. If the value is set todelete
, items with key violations are deleted from the table. -
-c | --correct
value
— Read index key violations from an input file, and take corrective actions on the items in the table. If the value of this parameter is set toupdate
, items with key violations are updated with new, non-violating values. If the value is set todelete
, items with key violations are deleted from the table.
The Violation Detector configuration file
At runtime, the Violation Detector tool requires a configuration file. The parameters in this file determine which DynamoDB resources that Violation Detector can access, and how much provisioned throughput it can consume. The following table describes these parameters.
Parameter name | Description | Required? |
---|---|---|
|
The fully qualified name of a file containing your AWS credentials. The credentials file must be in the following format:
|
Yes |
|
The AWS Region in which the table resides. For example:
|
Yes |
|
The name of the DynamoDB table to be scanned. |
Yes |
|
The name of the index partition key. |
Yes |
|
The data type of the index partition
key—
|
Yes |
|
The name of the index sort key. Do not specify this parameter if the index only has a simple primary key (partition key). |
No |
|
The data type of the index sort key—
Do not specify this parameter if the index only has a simple primary key (partition key). |
No |
|
Whether to write the full details of index key violations to the
output file. If set to |
No |
|
Whether to write the values of the violating index keys to the
output file. If set to |
No |
|
The full path of the Violation Detector output file. This parameter supports writing to a local directory or to Amazon Simple Storage Service (Amazon S3). The following are examples:
Information in the output file appears in comma-separated values
(CSV) format. If you don't set |
No |
|
The number of parallel scan segments to be used when Violation Detector scans the table. The default value is 1, meaning that the table is scanned in a sequential manner. If the value is 2 or higher, then Violation Detector divides the table into that many logical segments and an equal number of scan threads. The maximum setting for
For larger tables, a parallel scan is generally faster than a sequential scan. In addition, if the table is large enough to span multiple partitions, a parallel scan distributes its read activity evenly across multiple partitions. For more information about parallel scans in DynamoDB, see Parallel scan. |
No |
|
The upper limit of index key violations to write to the output
file. If set to |
No |
|
The number of items in the table to be scanned. If set to -1 (the default), the entire table is scanned. If set to a positive integer, Violation Detector stops after it scans that many items in the table. |
No |
|
Regulates the percentage of provisioned read capacity units that
are consumed during the table scan. Valid values range from
|
No |
|
The full path of the Violation Detector correction input file. If you run Violation Detector in correction mode, the contents of this file are used to modify or delete data items in the table that violate the global secondary index. The format of the |
No |
|
The full path of the Violation Detector correction output file. This file is created only if there are update errors. This parameter supports writing to a local directory or to Amazon S3. The following are examples:
Information in the output file appears in CSV format. If you don't
set |
No |
Detection
To detect index key violations, use Violation Detector with the --detect
command
line option. To show how this option works, consider the ProductCatalog
table. The following is a list of items in the table. Only the primary key
(Id
) and the Price
attribute are shown.
Id (primary key) | Price |
---|---|
101 |
5 |
102 |
20
|
103 |
200
|
201 |
100
|
202 |
200
|
203 |
300
|
204 |
400
|
205 |
500
|
All of the values for Price
are of type Number
. However,
because DynamoDB is schemaless, it is possible to add an item with a non-numeric
Price
. For example, suppose that you add another item to the
ProductCatalog
table.
Id (primary key) | Price |
---|---|
999 |
"Hello" |
The table now has a total of nine items.
Now you add a new global secondary index to the table: PriceIndex
. The primary key for
this index is a partition key, Price
, which is of type Number
.
After the index has been built, it will contain eight items—but the
ProductCatalog
table has nine items. The reason for this discrepancy is
that the value "Hello"
is of type String
, but
PriceIndex
has a primary key of type Number
. The
String
value violates the global secondary index key, so it is not present in the
index.
To use Violation Detector in this scenario, you first create a configuration file such as the following.
# Properties file for violation detection tool configuration. # Parameters that are not specified will use default values. awsCredentialsFile = /home/alice/credentials.txt dynamoDBRegion = us-west-2 tableName = ProductCatalog gsiHashKeyName = Price gsiHashKeyType = N recordDetails = true recordGsiValueInViolationRecord = true detectionOutputPath = ./gsi_violation_check.csv correctionInputPath = ./gsi_violation_check.csv numOfSegments = 1 readWriteIOPSPercent = 40
Next, you run Violation Detector as in the following example.
$
java -jar ViolationDetector.jar --configFilePath config.txt --detect keepViolation detection started: sequential scan, Table name: ProductCatalog, GSI name: PriceIndex Progress: Items scanned in total: 9, Items scanned by this thread: 9, Violations found by this thread: 1, Violations deleted by this thread: 0 Violation detection finished: Records scanned: 9, Violations found: 1, Violations deleted: 0, see results at: ./gsi_violation_check.csv
If the recordDetails
config parameter is set to true
,
Violation Detector writes details of each violation to the output file, as in the following
example.
Table Hash Key,GSI Hash Key Value,GSI Hash Key Violation Type,GSI Hash Key Violation Description,GSI Hash Key Update Value(FOR USER),Delete Blank Attributes When Updating?(Y/N) 999,"{""S"":""Hello""}",Type Violation,Expected: N Found: S,,
The output file is in CSV format. The first line in the file is a header, followed by one record per item that violates the index key. The fields of these violation records are as follows:
-
Table hash key — The partition key value of the item in the table.
-
Table range key — The sort key value of the item in the table.
-
GSI hash key value — The partition key value of the global secondary index.
-
GSI hash key violation type — Either
Type Violation
orSize Violation
. -
GSI hash key violation description — The cause of the violation.
-
GSI hash key update Value(FOR USER) — In correction mode, a new user-supplied value for the attribute.
-
GSI range key value — The sort key value of the global secondary index.
-
GSI range key violation type — Either
Type Violation
orSize Violation
. -
GSI range key violation description — The cause of the violation.
-
GSI range key update Value(FOR USER) — In correction mode, a new user-supplied value for the attribute.
-
Delete blank attribute when Updating(Y/N) — In correction mode, determines whether to delete (Y) or keep (N) the violating item in the table—but only if either of the following fields are blank:
-
GSI Hash Key Update Value(FOR USER)
-
GSI Range Key Update Value(FOR USER)
If either of these fields are non-blank, then
Delete Blank Attribute When Updating(Y/N)
has no effect. -
Note
The output format might vary, depending on the configuration file and command line options. For example, if the table has a simple primary key (without a sort key), no sort key fields will be present in the output.
The violation records in the file might not be in sorted order.
Correction
To correct index key violations, use Violation Detector with the --correct
command line option. In correction mode, Violation Detector reads the input file specified by
the correctionInputPath
parameter. This file has the same format as the
detectionOutputPath
file, so that you can use the output from detection
as input for correction.
Violation Detector provides two different ways to correct index key violations:
-
Delete violations — Delete the table items that have violating attribute values.
-
Update violations — Update the table items, replacing the violating attributes with non-violating values.
In either case, you can use the output file from detection mode as input for correction mode.
Continuing with the ProductCatalog
example, suppose that you want to
delete the violating item from the table. To do this, you use the following command
line.
$
java -jar ViolationDetector.jar --configFilePath config.txt --correct delete
At this point, you are asked to confirm whether you want to delete the violating items.
Are you sure to delete all violations on the table?y/n
y
Confirmed, will delete violations on the table... Violation correction from file started: Reading records from file: ./gsi_violation_check.csv, will delete these records from table. Violation correction from file finished: Violations delete: 1, Violations Update: 0
Now both ProductCatalog
and PriceIndex
have the same number
of items.