Building custom data identifiers in Amazon Macie - Amazon Macie

Building custom data identifiers in Amazon Macie

A custom data identifier is a set of criteria that you define to detect sensitive data. By using custom data identifiers, you can define custom detection rules that reflect your organization's particular scenarios, intellectual property, or proprietary data—for example, employee IDs, customer account numbers, or internal data classifications. These identifiers enable you to perform targeted analysis of your organization's data in a way that supplements the managed data identifiers that Amazon Macie provides.

Components of a custom data identifier

When you create a custom data identifier, you specify a regular expression (regex) that defines a text pattern to match in data. The regex can contain as many as 500 characters.

You can also specify certain character sequences, such as words and phrases, and a proximity rule to refine your analysis of data.

Keywords

These are character sequences that must be within proximity of text that matches the regex pattern.

For unstructured data, such as the contents of a Microsoft Word document, Macie reports text that contains any of these keywords if the text matches the regex pattern and is within the maximum match distance of one of these words.

For structured data, such as the contents of a CSV file, Macie reports text that matches the regex pattern if any of these keywords are in the name of the column or field that stores the text, or the text is within the maximum match distance of one of these words in a field value.

You can specify as many as 50 keywords. Each keyword can contain 3–90 characters. Keywords aren't case sensitive.

Maximum match distance

This is the maximum number of characters that can exist between text that matches the regex pattern and one or more keywords in unstructured data or, in structured data, a single field value. If text matches the regex pattern and is within the specified distance from a keyword, Macie reports that occurrence of the text.

You can specify a distance of 1–300 characters. The default distance is 50 characters.

Ignore words

These are specific character sequences to exclude from the results. If text matches the regex pattern and it contains an ignore word, Macie doesn't report that occurrence of the text.

You can specify as many as 10 ignore words. Each ignore word can contain 4–90 characters. Ignore words are case sensitive.

For example, many companies have a specific syntax for employee IDs. One such syntax might be: a capital letter that indicates whether the employee is a full-time (F) or part-time (P) employee, followed by a hyphen (-), followed by an eight-digit sequence that identifies the employee. Examples are: F-12345678, for a full-time employee, and P-87654321, for a part-time employee.

If you create a custom data identifier to detect employee IDs that use this syntax, you might use the following regex: [A-Z]-\d{8}. To refine the analysis and avoid false positives, you might also configure the custom data identifier to report only those instances where the keyword employee is within a specific distance of text that matches the regex pattern.

Creating a custom data identifier

The following steps explain how to create a custom data identifier by using the Amazon Macie console.

To create a custom data identifier

  1. Open the Macie console at https://console.aws.amazon.com/macie/.

  2. In the navigation pane, under Settings, choose Custom data identifiers.

  3. Choose Create.

  4. For Name, enter a name for the custom data identifier. The name can contain as many as 128 characters.

    We strongly recommend that you avoid including any sensitive data in this name. Other users of your account might be able to see the identifier's name, depending on the actions that they're allowed to perform in Macie.

  5. For Description, enter a brief description of the custom data identifier. The description can contain as many as 512 characters.

    We strongly recommend that you avoid including any sensitive data in the description. Other users of your account might be able to see the identifier's description, depending on the actions that they're allowed to perform in Macie.

  6. For Regular expression, enter the regular expression (regex) that defines the pattern to match. The regex can contain as many as 500 characters. To learn about supported syntax and constraints, see Regex support in custom data identifiers later in this section.

  7. (Optional) For Keywords, enter as many as 50 keywords (separated by commas) that define specific text to match. Each keyword can contain 3–90 characters. Keywords aren't case sensitive.

    Macie includes a result if the text contains any of these keywords, the text matches the regex pattern, and the text is within the maximum match distance of one of these keywords.

  8. (Optional) For Ignore words, enter up to 10 expressions (separated by commas) that define specific text to exclude from the results. Each ignore word can contain 4–90 characters. Ignore words are case sensitive.

    Macie excludes results for text that contains any of these words, even if the text matches the regex pattern.

  9. (Optional) For Maximum match distance, enter the maximum allowable distance between text that matches the regex pattern and any of the keywords. The default distance is 50 characters.

    Macie includes a result only if the text matches the regex pattern and is within this distance of a keyword.

  10. (Optional) Test the custom data identifier by pasting up to 1,000 characters of text into the Sample data box, and then choosing Submit. Macie evaluates the sample data by using the identifier, and reports the number of matches. You can repeat this step as many times as you like to refine and optimize the identifier.

    Note

    We highly recommend that you test and refine the custom data identifier before you save it. Because custom data identifiers are used by sensitive data discovery jobs, you can't edit a custom data identifier after you save it. This helps ensure that you have an immutable history of sensitive data findings and discovery results for data privacy and protection audits or investigations that you perform.

  11. When you finish, choose Submit.

After you create a custom data identifier, you can use it to analyze data in specific S3 buckets by creating a sensitive data discovery job. When you create a job, you optionally specify one or more custom data identifiers that you want the job to use.

Regex support in custom data identifiers

Macie supports a subset of the regex pattern syntax provided by the Perl Compatible Regular Expressions (PCRE) library.

Of the constructs provided by the PCRE library, Macie doesn’t support the following pattern elements:

  • Backreferences

  • Capturing groups

  • Conditional patterns

  • Embedded code

  • Global pattern flags, such as /i, /m, and /x

  • Recursive patterns

  • Positive and negative look-behind and look-ahead zero-width assertions, such as ?=, ?!, ?<=, and ?<!

To protect against malformed or long-running expressions, Macie automatically tests custom data identifiers against a collection of sample text.

The following tips and recommendations can help you create effective regex patterns for custom data identifiers in Macie:

  • Anchors – Use anchors (^ or $) only if you expect the pattern to appear at the beginning or end of the S3 object, not the beginning or end of a line.

  • Bounded repeats – For performance reasons, Macie limits the size of bounded repeat groups. For example, \d{100,1000} won’t compile in Macie. To approximate this functionality, you can use an open-ended repeat such as \d{100,}.

  • Case insensitivity – To make parts of a pattern case insensitive, you can use the (?i) construct instead of the /i flag.

  • Performance – There’s no need to optimize prefixes or alternations manually. For example, changing /hello|hi|hey/ to /h(?:ello|i|ey)/ won’t improve performance.

  • Wildcards – For performance reasons, Macie limits the number of repeated wildcards. For example, a*b*a* won’t compile in Macie.