Building custom data identifiers in Amazon Macie - Amazon Macie

Building custom data identifiers in Amazon Macie

A custom data identifier is a set of criteria that you define to detect sensitive data in Amazon Simple Storage Service (Amazon S3) objects. The criteria consist of a regular expression (regex) that defines a text pattern to match and, optionally, character sequences and a proximity rule that refine the results.

With custom data identifiers, you can define detection criteria that reflect your organization's particular scenarios, intellectual property, or proprietary data—for example, employee IDs, customer account numbers, or internal data classifications. If you configure sensitive data discovery jobs or automated sensitive data discovery to use these identifiers, you can analyze S3 objects in a way that supplements the managed data identifiers that Amazon Macie provides.

In addition to detection criteria, you can define custom severity settings for sensitive data findings that a custom data identifier produces. By default, Macie assigns the Medium severity to all findings that a custom data identifier produces—severity doesn't change based on the number of occurrences of text that match a custom data identifier's detection criteria. By defining custom severity settings, you can specify which severity to assign based on the number of occurrences of text that match the criteria.

Defining detection criteria for custom data identifiers

When you create a custom data identifier, you specify a regular expression (regex) that defines a text pattern to match in S3 objects. Macie supports a subset of the regex pattern syntax provided by the Perl Compatible Regular Expressions (PCRE) library. For more information, see Regex support later in this section.

You can also specify character sequences, such as words and phrases, and a proximity rule to refine the results.

Keywords

These are specific character sequences that must be in proximity of text that matches the regex pattern. The proximity requirements vary based on an S3 object's storage format or file type:

  • For structured, columnar data, Macie includes a result if the text matches the regex pattern and a keyword is in the name of the field or column that stores the text, or the text is preceded by and within the maximum match distance of a keyword in the same field or cell value. This is true for Microsoft Excel workbooks, CSV files, and TSV files.

  • For structured, record-based data, Macie includes a result if the text matches the regex pattern and the text is within the maximum match distance of a keyword. The keyword can be in the name of an element in the path to the field or array that stores the text, or it can precede and be part of the same value in the field or array that stores the text. This is true for Apache Avro object containers, Apache Parquet files, JSON files, and JSON Lines files.

  • For unstructured data, Macie includes a result if the text matches the regex pattern and the text is preceded by and within the maximum match distance of a keyword. This is true for Adobe Portable Document Format files, Microsoft Word documents, email messages, and non-binary text files other than CSV, JSON, JSON Lines, and TSV files. This includes any structured data, such as tables, in these types of files.

You can specify as many as 50 keywords. Each keyword can contain 3–90 UTF-8 characters. Keywords aren't case sensitive.

Maximum match distance

This is a character-based proximity rule for keywords. Macie uses this setting to determine whether a keyword precedes text that matches the regex pattern. The setting defines the maximum number of characters that can exist between the end of a complete keyword and the end of text that matches the regex pattern. If text matches the regex pattern, occurs after at least one complete keyword, and occurs within the specified distance of the keyword, Macie includes it in the results. Otherwise, Macie excludes it from the results.

You can specify a distance of 1–300 characters. The default distance is 50 characters. For best results, this distance should be greater than the minimum number of characters of text that the regex is designed to detect. If only part of the text is within the maximum match distance of a keyword, Macie doesn’t include it in the results.

Ignore words

These are specific character sequences to exclude from the results. If text matches the regex pattern but it contains an ignore word, Macie doesn't include it in the results.

You can specify as many as 10 ignore words. Each ignore word can contain 4–90 UTF-8 characters. Ignore words are case sensitive.

For example, many companies have a specific syntax for employee IDs. One such syntax might be: a capital letter that indicates whether the employee is a full-time (F) or part-time (P) employee, followed by a hyphen (-), followed by an eight-digit sequence that identifies the employee. Examples are: F-12345678, for a full-time employee, and P-87654321, for a part-time employee.

If you create a custom data identifier to detect employee IDs that use this syntax, you might use the following regex: [A-Z]-\d{8}. To refine the analysis and avoid false positives, you might also configure the custom data identifier to use the keywords employee and employee ID and a maximum match distance of 20 characters. With these criteria, the results include text that matches the regex only if the text occurs after the keyword employee or employee ID and all the text occurs within 20 characters of one of those keywords.

For a demonstration of how keywords can help you find sensitive data and avoid false positives, watch the following video:

Defining finding severity settings for custom data identifiers

When you create a custom data identifier, you can also define custom severity settings for sensitive data findings that the identifier produces. By default, Macie assigns the Medium severity to all findings that a custom data identifier produces. If an S3 object contains at least one occurrence of text that matches the detection criteria of a custom data identifier, Macie automatically assigns the Medium severity to the resulting finding.

With custom severity settings, you specify which severity to assign based on the number of occurrences of text that match the detection criteria. You can define occurrences thresholds for as many as three severity levels: Low (least severe), Medium, and High (most severe). An occurrences threshold is the minimum number of matches that must exist in an S3 object to produce a finding with the specified severity. If you specify more than one threshold, the thresholds must be in ascending order by severity, moving from Low to High.

For example, the following image shows the severity settings for a custom data identifier that specifies three occurrences thresholds, one for each severity level that Macie supports.

Severity settings that specify occurrences thresholds for Low, Medium, and High severity levels.

The following table indicates the severity of the findings that the custom data identifier produces.

Occurrences threshold Severity level Result
1 Low If an S3 object contains 1–49 occurrences of text that match the detection criteria, the severity of the resulting finding is Low.
50 Medium If an S3 object contains 50–99 occurrences of text that match the detection criteria, the severity of the resulting finding is Medium.
100 High If an S3 object contains 100 or more occurrences of text that match the detection criteria, the severity of the resulting finding is High.

You can also use severity settings to specify whether to create a finding at all. If an S3 object contains fewer occurrences than the lowest occurrences threshold, Macie doesn't create a finding.

Creating custom data identifiers

Follow these steps to create a custom data identifier by using the Amazon Macie console. To create a custom data identifier programmatically, use the CreateCustomDataIdentifier operation of the Amazon Macie API.

To create a custom data identifier
  1. Open the Amazon Macie console at https://console.aws.amazon.com/macie/.

  2. In the navigation pane, under Settings, choose Custom data identifiers.

  3. Choose Create.

  4. For Name, enter a name for the custom data identifier. The name can contain as many as 128 characters.

    Avoid including any sensitive data in the name. Other users of your account might be able to see the name, depending on the actions that they're allowed to perform in Macie.

  5. (Optional) For Description, enter a brief description of the custom data identifier. The description can contain as many as 512 characters.

    Avoid including any sensitive data in the description. Other users of your account might be able to see the description, depending on the actions that they're allowed to perform in Macie.

  6. For Regular expression, enter the regular expression (regex) that defines the text pattern to match. The regex can contain as many as 512 characters. To learn about supported syntax and constraints, see Regex support later in this section.

  7. (Optional) For Keywords, enter as many as 50 character sequences (separated by commas) to define specific text that must be in proximity of text that matches the regex pattern. Each keyword can contain 3–90 UTF-8 characters. Keywords aren't case sensitive.

    Macie includes an occurrence in the results only if the text matches the regex pattern and the text is within the maximum match distance of one of these keywords, as explained in the preceding topic.

  8. (Optional) For Ignore words, enter as many as 10 character sequences (separated by commas) that define specific text to exclude from the results. Each ignore word can contain 4–90 UTF-8 characters. Ignore words are case sensitive.

    Macie excludes an occurrence from the results if the text matches the regex pattern but it contains one of these ignore words.

  9. (Optional) For Maximum match distance, enter the maximum number of characters that can exist between the end of a keyword and the end of text that matches the regex pattern. The distance can be 1–300 characters. The default distance is 50 characters.

    Macie includes an occurrence in the results only if the text matches the regex pattern and the text is within this distance of a complete keyword, as explained in the preceding topic.

  10. For Severity, choose how you want Macie to assign severity to sensitive data findings that the custom data identifier produces:

    • To automatically assign the Medium severity to all findings, choose Use Medium severity for any number of matches (default). With this option, Macie automatically assigns the Medium severity to a finding if the affected S3 object contains one or more occurrences of text that match the detection criteria.

    • To assign severity based on occurrences thresholds that you specify, choose Use custom settings to determine severity. Then use the Occurrences threshold and Severity level options to specify the minimum number of matches that must exist in an S3 object to produce a finding with a selected severity.

      For example, to assign the High severity to a finding that reports 100 or more occurrences of text that match the detection criteria, enter 100 in the Occurrences threshold box and then choose High from the Severity level list.

      You can specify as many as three occurrences thresholds, one for each severity level that Macie supports: Low (for least severe), Medium, or High (for most severe). If you specify more than one, the thresholds must be in ascending order by severity, moving from Low to High. If an S3 object contains fewer occurrences than the lowest specified threshold, Macie doesn't create a finding.

  11. (Optional) For Tags, choose Add tag, and then enter as many as 50 tags to assign to the custom data identifier.

    tag is a label that you define and assign to certain types of AWS resources. Each tag consists of a required tag key and an optional tag value. Tags can help you identify, categorize, and manage resources in different ways, such as by purpose, owner, environment, or other criteria. To learn more, see Tagging Amazon Macie resources.

  12. (Optional) For Evaluate, enter up to 1,000 characters in the Sample data box, and then choose Test to test the detection criteria. Macie evaluates the sample data and reports the number of occurrences of text that match the criteria. You can repeat this step as many times as you like to refine and optimize the criteria.

    Note

    We strongly recommend that you test and refine the detection criteria before you save the custom data identifier. Because custom data identifiers are used by sensitive data discovery jobs, you can't edit a custom data identifier after you save it. This helps ensure that you have an immutable history of sensitive data findings and discovery results for data privacy and protection audits or investigations that you perform.

  13. When you finish, choose Submit.

Macie tests the settings and verifies that it can compile the regex. If there's an issue with any of the settings or the regex, an error occurs and indicates the nature of the issue. After you address any issues, you can save the custom data identifier.

Regex support in custom data identifiers

Macie supports a subset of the regex pattern syntax provided by the Perl Compatible Regular Expressions (PCRE) library. Of the constructs provided by the PCRE library, Macie doesn’t support the following pattern elements:

  • Backreferences

  • Capturing groups

  • Conditional patterns

  • Embedded code

  • Global pattern flags, such as /i, /m, and /x

  • Recursive patterns

  • Positive and negative look-behind and look-ahead zero-width assertions, such as ?=, ?!, ?<=, and ?<!

To create effective regex patterns for custom data identifiers, also note the following tips and recommendations:

  • Anchors – Use anchors (^ or $) only if you expect the pattern to appear at the beginning or end of a file, not the beginning or end of a line.

  • Bounded repeats – For performance reasons, Macie limits the size of bounded repeat groups. For example, \d{100,1000} won’t compile in Macie. To approximate this functionality, you can use an open-ended repeat such as \d{100,}.

  • Case insensitivity – To make parts of a pattern case insensitive, you can use the (?i) construct instead of the /i flag.

  • Performance – There’s no need to optimize prefixes or alternations manually. For example, changing /hello|hi|hey/ to /h(?:ello|i|ey)/ won’t improve performance.

  • Wildcards – For performance reasons, Macie limits the number of repeated wildcards. For example, a*b*a* won’t compile in Macie.

To protect against malformed or long-running expressions, Macie automatically tests regex patterns against a collection of sample text.