Detect and process sensitive data - AWS Glue Studio

Detect and process sensitive data

The Detect PII transform identifies Personal Identifiable Information (PII) in your data source. You choose the PII entity to identify, how you want the data to be scanned, and what to do with the PII entity that have been identified by the Detect PII transform.

The Detect PII transform provides the ability to detect, mask, or remove entities that you define, or are pre-defined by AWS. This enables you to increase compliance and reduce liability. For example, you may want to ensure that no personally identifiable information exists in your data that can be read and want to mask social security numbers with a fixed string (such as xxx-xx-xxxx), phone numbers, or addresses.

Choosing how you want the data to be scanned

You can choose to detect PII in the entire data source, or detect the fields columns that contain PII.


                
                    The screen shot shows the options in the Detect PII transform for how the data source can be scanned. You can choose to
                    detect PII in the entire data source by scanning all rows and columns, or to detect PII in columns that contain PII
                    by sampling rows.

When you choose Detect PII in each cell, you’re choosing to scan all rows in the data source. This is a comprehensive scan to ensure that PII entities are identified.

When you choose Detect fields containing PII, you’re choosing to scan a sample of rows for PII entities. This is a way to keep costs and resources low while also identifying the fields where PII entities are found.

When you choose to detect fields that contain PII, you can reduce costs and improve performance by sampling a portion of rows. Choosing this option will allow you to specify additional options:

  • Sample portion: This allows you to specify the percentage of rows to sample. For example, if you enter ‘50’, you’re specifying that you want 50 percent of scanned rows for the PII entity.

  • Detection threshold: This allows you to specify the percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity. For example, if you enter ‘10’, you’re specifying that the number of the PII entity, US Phone, in the rows that are scanned must be 10 percent or greater in order for the field to be identified as having the PII entity, US Phone. If the percentage of rows that contain the PII entity is less than 10 percent, that field will not be labeled as having the PII entity, US Phone, in it.


                
                    The screen shot shows the options in the Detect PII transform when selecting to detect fields that contain PII in the data source.

Choosing the PII entities to detect

If you chose Detect PII in each cell, you can choose from one of three options:

  • All available PII patterns - this includes AWS entities.

  • Select categories - when you select categories, PII patterns will automatically include patterns in the categories that you select.

  • Select specific patterns - Only the patterns that you select will be detected.

Choose from all available PII patterns

If you choose All available PII patterns, select entities pre-defined by AWS. You can select one, more than one, or all entities.


                
                    The screen shot shows the options in the list of pre-defined AWS entities.

Select categories

If you chose Select categories as the PII patterns to detect, you can select from the options in the drop-down menu. Note that some entities can belong to more than one category. For example, Person's name is an entity that belongs to the Universal and HIPAA categories.

  • Universal (examples: Email, Credit Card)

  • HIPAA (examples: US Driving License, Healthcare Common Procedure Coding System (HCPCS) code)

  • Networking (examples: IP Address, MAC Address)

  • United States (examples: US Phone, US Passport)

  • United Kingdom (examples: UK Bank Account, UK VAT)

  • Japan (examples: Japan My Number, Japan Passport)

Select specific patterns

If you choose Select specific patterns as the PII patterns to detect, you can search or browse from a list of patterns you've already created, or create a new detection entity pattern.

The steps below describe how to create a new custom pattern for detecting sensitive data. You will create the custom pattern by entering a name for the custom pattern, add a regular expression, and optionally, define context words.

  1. To create a new pattern, click the Create new button.

    
                            
                                The screen shot shows the Select patterns section.
  2. In the Create detection entity page, enter the entity name and a regular expression. The regular expression (Regex) is what AWS Glue will use to match entities.

  3. Click Validate. If the validation is successful, you will see a confirmation message stating that the string is a valid regular expression. If the validation is not successful, you will see a message stating that the string does not conform to proper formatting and accepted character literals, operators or constructs.

  4. You can choose to add Context words in addition to the regular expression. Context words may increase the likelihood of a match. These can be useful in cases where field names are not descriptive of the entity. For example, social security numbers may be named 'SSN' or 'SS'. Adding these context words can help match the entity.

  5. Click Create to create the detection entity. Any created entities are visible in the AWS Glue Studio console. Click on Detection entities in the left-hand navigation menu.

    You can edit, delete, or create detection entities from the Detection entities page. You can also search for a pattern using the search field.

Choosing what to do with identified PII data

If you chose to detect PII in the entire data source, you can choose to:

  • Enrich data with detection results: If you chose Detect PII in each cell, you can store the detected entities into a new column.

  • Redact detected text: You can replace the detected PII value with a string that you specify in the optional Replacing text input field. If no string is specified, the detected PII entity is replaced with '*******'.

  • Apply cryptographic hash: You can pass the detected PII value to a SHA-256 cryptographic hash function and replace the value with the function’s output.


                
                    The screen shot shows the options in the Detect PII transform when selecting all rows in the data source to detect PII.

If you chose to detect fields containing PII, you can choose to take the following actions:

  • Output Detection Results: This creates a new DataFrame with the detected PII information for each column.

  • Redact detected text: You can replace the detected PII value with a string that you specify. If no string is specified, the detected PII entity is replaced with '*******'.

  • Apply cryptographic hash: You can pass the detected PII value to a SHA-256 cryptographic hash function and replace the value with the function’s output.