A custom data identifier is a set of criteria that you define to detect sensitive data in Amazon Simple Storage Service (Amazon S3) objects. When you create a custom data identifier, you specify a regular expression (regex) that defines a text pattern to match in an S3 object. You can also specify character sequences and a proximity rule that refine the results. The character sequences can be: keywords, which are words or phrases that must be in proximity of text that matches the regex, or ignore words, which are words or phrases to exclude from results. By using custom data identifiers, you can supplement the managed data identifiers that Amazon Macie provides, and detect sensitive data that reflects your organization's particular scenarios, intellectual property, or proprietary data.
For example, many companies have a specific syntax for employee IDs. One such syntax
might be: a capital letter that indicates whether an employee is a full-time (F) or part-time (P)
employee, followed by a hyphen (–), followed by an eight-digit sequence that
identifies the employee. Examples are: F–12345678 for a full-time employee, and P–87654321 for a part-time employee. To detect employee IDs that
use this syntax, you might create a custom data identifier that specifies the following
regex: [A-Z]-\d{8}
. To refine the analysis and avoid false positives, you
might also configure the identifier to use keywords (employee
and
employee ID
) and a maximum match distance of 20 characters. With these
criteria, results include text that matches the regex if the text occurs after the
keyword employee or employee
ID and all the text occurs within 20 characters of one of those
keywords.
For a demonstration of how keywords can help you find sensitive data and avoid false positives, watch the following video:
In addition to detection criteria, you can optionally specify custom severity settings for findings that a custom data identifier produces. Severity can be based on the number of occurrences of text that match the identifier's detection criteria. If you don't specify these settings, Macie automatically assigns the Medium severity to all the findings that the identifier produces. Severity doesn't change based on the number of occurrences of text that match the identifier's detection criteria.
For detailed information about these and other settings, see Configuration options for custom data identifiers.
To create a custom data identifier
You can create a custom data identifier by using the Amazon Macie console or the Amazon Macie API.
Follow these steps to create a custom data identifier by using the Amazon Macie console.
To create a custom data identifier
Open the Amazon Macie console at https://console.aws.amazon.com/macie/
. -
In the navigation pane, under Settings, choose Custom data identifiers.
-
Choose Create.
-
For Name, enter a name for the custom data identifier. The name can contain as many as 128 characters.
-
For Description, optionally enter a brief description of the custom data identifier. The description can contain as many as 512 characters.
Note
Avoid including sensitive data in the name or description of a custom data identifier. Other users of your account might be able to access the name or description, depending on the actions that they're allowed to perform in Macie.
-
For Regular expression, enter the regular expression (regex) that defines the text pattern to match. The regex can contain as many as 512 characters.
Macie supports a subset of the pattern syntax provided by the Perl Compatible Regular Expressions (PCRE) library
. For additional details and tips, see Detection criteria for custom data identifiers. -
For Keywords, optionally enter as many as 50 character sequences (separated by commas) to define specific text that must be in proximity of text that matches the regex pattern.
Macie includes an occurrence in results only if the text matches the regex pattern and the text is within the maximum match distance of one of these keywords. Each keyword can contain 3–90 UTF-8 characters. Keywords aren't case sensitive.
-
For Ignore words, optionally enter as many as 10 character sequences (separated by commas) that define specific text to exclude from results.
Macie excludes an occurrence from results if the text matches the regex pattern but it contains one of these ignore words. Each ignore word can contain 4–90 UTF-8 characters. Ignore words are case sensitive.
-
For Maximum match distance, optionally enter the maximum number of characters that can exist between the end of a keyword and the end of text that matches the regex pattern.
Macie includes an occurrence in results only if the text matches the regex pattern and the text is within this distance of a complete keyword. The distance can be 1–300 characters. The default distance is 50 characters.
-
For Severity, choose how to determine the severity of sensitive data findings that the custom data identifier produces:
-
To automatically assign the Medium severity to all findings, choose Use Medium severity for any number of matches (default). With this option, Macie automatically assigns the Medium severity to a finding if the affected S3 object contains one or more occurrences of text that match the detection criteria.
-
To assign severity based on occurrences thresholds that you specify, choose Use custom settings to determine severity. Then use the Occurrences threshold and Severity level options to specify the minimum number of matches that must exist in an S3 object to produce a finding with a selected severity.
You can specify as many as three occurrences thresholds, one for each severity level that Macie supports: Low (least severe), Medium, or High (most severe). If you specify more than one, the thresholds must be in ascending order by severity, moving from Low to High. If an S3 object contains fewer occurrences than the lowest threshold, Macie doesn't create a finding.
-
-
(Optional) For Tags, choose Add tag, and then enter as many as 50 tags to assign to the custom data identifier.
A tag is a label that you define and assign to certain types of AWS resources. Each tag consists of a required tag key and an optional tag value. Tags can help you identify, categorize, and manage resources in different ways, such as by purpose, owner, environment, or other criteria. To learn more, see Tagging Macie resources.
-
(Optional) For Evaluate, enter up to 1,000 characters in the Sample data box, and then choose Test to test the detection criteria. Macie evaluates the sample data and reports the number of occurrences of text that match the criteria. You can repeat this step as many times as you like to refine and optimize the criteria.
Note
We strongly recommend that you test and refine the detection criteria with sample data. Because custom data identifiers are used by sensitive data discovery jobs, you can't change a custom data identifier after you create it. This helps ensure that you have an immutable history of sensitive data findings and discovery results.
-
When you finish, choose Submit.
Macie tests the settings and verifies that it can compile the regex. If there's an issue with a setting or the regex, Macie displays an error that describes the issue. After you address any issues, you can save the custom data identifier.
After you create the custom data identifier, you can create and configure sensitive data discovery jobs to use it, or add it to your settings for automated sensitive data discovery.