Detect and process sensitive data
The Detect PII transform identifies Personal Identifiable Information (PII) in your data source. You choose the PII entity to identify, how you want the data to be scanned, and what to do with the PII entity that have been identified by the Detect PII transform.
The Detect PII transform provides the ability to detect, mask, or remove entities that you define, or are pre-defined by AWS. This enables you to increase compliance and reduce liability. For example, you may want to ensure that no personally identifiable information exists in your data that can be read and want to mask social security numbers with a fixed string (such as xxx-xx-xxxx), phone numbers, or addresses.
Topics
Choosing how you want the data to be scanned
You can choose to detect PII in the entire data source, or detect the fields columns that contain PII.

When you choose Detect PII in each cell, you’re choosing to scan all rows in the data source. This is a comprehensive scan to ensure that PII entities are identified.
When you choose Detect fields containing PII, you’re choosing to scan a sample of rows for PII entities. This is a way to keep costs and resources low while also identifying the fields where PII entities are found.
When you choose to detect fields that contain PII, you can reduce costs and improve performance by sampling a portion of rows. Choosing this option will allow you to specify additional options:
-
Sample portion: This allows you to specify the percentage of rows to sample. For example, if you enter ‘50’, you’re specifying that you want 50 percent of scanned rows for the PII entity.
-
Detection threshold: This allows you to specify the percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity. For example, if you enter ‘10’, you’re specifying that the number of the PII entity, US Phone, in the rows that are scanned must be 10 percent or greater in order for the field to be identified as having the PII entity, US Phone. If the percentage of rows that contain the PII entity is less than 10 percent, that field will not be labeled as having the PII entity, US Phone, in it.

Choosing the PII entities to detect
If you chose Detect PII in each cell, you can choose from one of three options:
-
All available PII patterns - this includes AWS entities.
-
Select categories - when you select categories, PII patterns will automatically include patterns in the categories that you select.
-
Select specific patterns - Only the patterns that you select will be detected.
Choose from all available PII patterns
If you choose All available PII patterns, select entities pre-defined by AWS. You can select one, more than one, or all entities.

Select categories
If you chose Select categories as the PII patterns to detect, you can select from the options in the drop-down menu. Note that some entities can belong to more than one category. For example, Person's name is an entity that belongs to the Universal and HIPAA categories.
-
Universal (examples: Email, Credit Card)
-
HIPAA (examples: US Driving License, Healthcare Common Procedure Coding System (HCPCS) code)
-
Networking (examples: IP Address, MAC Address)
-
United States (examples: US Phone, US Passport)
-
United Kingdom (examples: UK Bank Account, UK VAT)
-
Japan (examples: Japan My Number, Japan Passport)
Select specific patterns
If you choose Select specific patterns as the PII patterns to detect, you can search or browse from a list of patterns you've already created, or create a new detection entity pattern.
The steps below describe how to create a new custom pattern for detecting sensitive data. You will create the custom pattern by entering a name for the custom pattern, add a regular expression, and optionally, define context words.
-
To create a new pattern, click the Create new button.
-
In the Create detection entity page, enter the entity name and a regular expression. The regular expression (Regex) is what AWS Glue will use to match entities.
-
Click Validate. If the validation is successful, you will see a confirmation message stating that the string is a valid regular expression. If the validation is not successful, you will see a message stating that the string does not conform to proper formatting and accepted character literals, operators or constructs.
-
You can choose to add Context words in addition to the regular expression. Context words may increase the likelihood of a match. These can be useful in cases where field names are not descriptive of the entity. For example, social security numbers may be named 'SSN' or 'SS'. Adding these context words can help match the entity.
-
Click Create to create the detection entity. Any created entities are visible in the AWS Glue Studio console. Click on Detection entities in the left-hand navigation menu.
You can edit, delete, or create detection entities from the Detection entities page. You can also search for a pattern using the search field.
Choosing what to do with identified PII data
If you chose to detect PII in the entire data source, you can choose to:
-
Enrich data with detection results: If you chose Detect PII in each cell, you can store the detected entities into a new column.
-
Redact detected text: You can replace the detected PII value with a string that you specify in the optional Replacing text input field. If no string is specified, the detected PII entity is replaced with '*******'.
-
Apply cryptographic hash: You can pass the detected PII value to a SHA-256 cryptographic hash function and replace the value with the function’s output.

If you chose to detect fields containing PII, you can choose to take the following actions:
-
Output Detection Results: This creates a new DataFrame with the detected PII information for each column.
-
Redact detected text: You can replace the detected PII value with a string that you specify. If no string is specified, the detected PII entity is replaced with '*******'.
-
Apply cryptographic hash: You can pass the detected PII value to a SHA-256 cryptographic hash function and replace the value with the function’s output.