Creating a sensitive data discovery job - Amazon Macie

Creating a sensitive data discovery job

With Amazon Macie, you create and run sensitive data discovery jobs to automate discovery, logging, and reporting of sensitive data in Amazon Simple Storage Service (Amazon S3) buckets. A sensitive data discovery job analyzes objects in S3 buckets to determine whether the objects contain sensitive data, and it provides detailed reports of the sensitive data that it finds and the analysis that it performs.

When you create a job, you start by specifying which S3 buckets you want the job to analyze—specific buckets that you select or buckets that match specific criteria. Then you specify how often to run the job—once, or periodically on a daily, weekly, or monthly basis. You can also choose various options to refine the scope of the job's analysis. These options include custom criteria that derive from properties of S3 objects, such as last modified date and prefix.

After you define the schedule and scope of the job, you specify which managed data identifiers and custom data identifiers you want the job to use when it analyzes data:

  • A managed data identifier is a set of built-in criteria and techniques that are designed to detect a specific type of sensitive data—for example, credit card numbers, AWS secret access keys, or passport numbers for a particular country or region. These identifiers can detect a large and growing list of sensitive data types for many countries and regions, including multiple types of financial data, personal health information (PHI), and personally identifiable information (PII). For more information, see Using managed data identifiers.

  • A custom data identifier is a set of criteria that you define to detect sensitive data. With custom data identifiers, you can detect sensitive data that reflects your organization's particular scenarios, intellectual property, or proprietary data—for example, employee IDs, customer account numbers, or internal data classifications. These identifiers can supplement the managed data identifiers that Macie provides. For more information, see Building custom data identifiers.

When you finish choosing these options, you're ready to specify a name for the job, and then review and save the job.

Before you begin

Before you create a job, it's a good idea to take the following steps:

  • Verify that you configured Macie to store your sensitive data discovery results in an S3 bucket. To do this, choose Discovery results in the navigation pane on the Amazon Macie console, and then verify that you entered the settings. To learn about these settings, see Storing and retaining sensitive data discovery results.

  • Create any custom data identifiers that you want the job to use. To learn how, see Building custom data identifiers.

  • If you want the job to analyze objects that are encrypted with a customer managed AWS KMS key, ensure that Macie has permission to use the key. For more information, see Analyzing encrypted S3 objects.

  • If you want the job to analyze objects in a bucket that has a restrictive bucket policy, ensure that Macie is allowed to access objects in the bucket. For more information, see Allowing Macie to access S3 buckets and objects.

If you do these things before you create a job, you streamline creation of the job and help ensure that the job analyzes the data that you want.

Step 1: Choose S3 buckets

The first step in creating a job is to specify which S3 buckets you want the job to analyze. For this step, you have two options:

  • Select specific buckets – With this option, you explicitly select each S3 bucket that you want the job to analyze. Then, when the job runs, it analyzes objects only in the buckets that you select.

  • Specify bucket criteria – With this option, you define runtime criteria that determine which S3 buckets the job analyzes. The criteria consist of one or more conditions that derive from bucket properties. Then, when the job runs, it identifies buckets that match your criteria and analyzes objects in those buckets.

For detailed information about these options, see Scope options for sensitive data discovery jobs.

The following sections provide step-by-step instructions for choosing and configuring each option. Choose the section for the option that you want.

If you choose to explicitly select each S3 bucket that you want the job to analyze, Macie provides you with a complete inventory of your buckets in the current AWS Region. You can then use this inventory to select one or more buckets for the job to analyze. To learn about this inventory, see Selecting S3 buckets.

If you're the Macie administrator for an organization, the inventory includes buckets that are owned by member accounts in your organization. You can configure the job to analyze objects in as many as 1,000 of these buckets, spanning as many as 1,000 accounts.

To select specific buckets for the job

  1. Open the Macie console at https://console.aws.amazon.com/macie/.

  2. In the navigation pane, choose Jobs.

  3. Choose Create job.

  4. On the Choose S3 buckets page, choose Select specific buckets. Macie displays a table of all the buckets for your account in the current Region.

  5. Under Select S3 buckets, optionally choose refresh ( The refresh button, which is a button that contains an empty, dark gray circle with an arrow ) to retrieve the latest bucket metadata from Amazon S3.

    If the information icon ( A blue circle with a blue, lowercase letter i in it ) appears next to any bucket names, we recommend that you do this. This icon indicates that a bucket was created during the past 24 hours, possibly after Macie last retrieved bucket and object metadata from Amazon S3 as part of the daily refresh cycle.

  6. In the table, select the check box for each bucket that you want the job to analyze.

    Tip
    • To find specific buckets more easily, enter filter criteria in the filter bar above the table. You can also sort the table by choosing a column heading.

    • To quickly determine whether you already configured a job to periodically analyze objects in a bucket, refer to the Monitored column. If Yes appears in the column, the bucket is explicitly included in a periodic job or the bucket matched the criteria for a periodic job within the past 24 hours. In addition, the status of at least one of those jobs is not Cancelled. Macie updates this data on a daily basis.

    • To quickly determine when you most recently ran a periodic or one-time job to analyze objects in a bucket, refer to the Latest job run column. For additional information about that job, refer to the bucket's details.

    • To display a bucket's details, choose the bucket's name. In addition to job-related information, the details panel provides statistics and other information about the bucket, such as the bucket's public access settings. To learn more about this data, see Reviewing your S3 bucket inventory.

  7. When you finish selecting buckets, choose Next.

In the next step, you'll review and verify your selections.

If you choose to specify runtime criteria that determine which S3 buckets the job analyzes, Macie provides options to help you choose fields, operators, and values for individual conditions in the criteria. To learn more about these options, see Specifying S3 bucket criteria.

To specify bucket criteria for the job

  1. Open the Macie console at https://console.aws.amazon.com/macie/.

  2. In the navigation pane, choose Jobs.

  3. Choose Create job.

  4. On the Choose S3 buckets page, choose Specify bucket criteria.

  5. Under Specify bucket criteria, do the following to add a condition to the criteria:

    1. Place your cursor in the filter bar, and then choose the bucket property to use for the condition.

    2. In the first field, choose an operator for the condition, Equals or Not equals.

    3. In the next field, enter one or more values for the property.

      Depending on the type and nature of the bucket property, Macie displays different options for entering values. For example, if you choose the Effective permission property, Macie displays a list of values to choose from. If you choose the Account ID property, Macie displays a text box in which you can enter one or more AWS account IDs. To enter multiple values in a text box, enter each value and separate each entry with a comma.

    4. Choose Apply. Macie adds the condition to a filter box below the filter bar.

      By default, Macie adds the condition with an include statement. This means that the job is configured to analyze (include) objects in buckets that match the condition. To skip (exclude) buckets that match the condition, choose Include in the filter box, and then choose Exclude.

    5. Repeat the preceding steps for each additional condition that you want to add to the criteria.

  6. To test your criteria, expand the Preview the criteria results section. This section displays a table of all the buckets that currently match the criteria.

  7. To refine your criteria, do any of the following:

    • To remove a condition, choose X in the filter box for the condition.

    • To change a condition, remove the condition by choosing X in the filter box for the condition. Then add a condition that has the correct settings.

    • To remove all conditions, choose Clear filters.

    Macie updates the table of criteria results to reflect your changes.

  8. When you finish specifying bucket criteria, choose Next.

In the next step, you'll review and verify your criteria.

Step 2: Review your S3 bucket selections or criteria

For this step, verify that you chose the correct settings in the preceding step.

Review your bucket selections

If you selected specific S3 buckets for the job, review the table of buckets and change your bucket selections as necessary. The table provides insight into the projected scope and cost of the job's analysis. The data is based on the size and types of objects that are currently stored in a bucket.

The Estimated cost field indicates the total estimated cost (in US Dollars) of analyzing objects in a bucket. Each estimate reflects the projected amount of uncompressed data that the job will analyze in a bucket. If any objects are compressed or archive files, the estimate assumes that the files use a 3:1 compression ratio and the job can analyze all extracted files. For more information, see Forecasting and monitoring costs for sensitive data discovery jobs.

Review your bucket criteria

If you specified bucket criteria for the job, review each condition in the criteria. To change the criteria, choose Previous, and then use the filter settings in the preceding step to enter the correct criteria. When you finish, choose Next.

When you finish reviewing and verifying the settings, choose Next.

Step 3: Define the schedule and refine the scope

For this step, specify how often you want the job to run—once, or periodically on a daily, weekly, or monthly basis. Also choose various options to refine the scope of the job's analysis. To learn about these options, see Scope options for sensitive data discovery jobs.

To define the schedule and refine the scope of the job

  1. On the Refine the scope page, choose how often you want the job to run:

    • To run the job only once, immediately after you finish creating it, choose One-time job.

    • To run the job periodically on a recurring basis, choose Scheduled job. For Update frequency, choose whether to run the job daily, weekly, or monthly. Then use the Include existing objects option to define the scope of the job's first run:

      • Select this check box to analyze all existing objects immediately after you finish creating the job. Each subsequent run analyzes only those objects that are created or changed after the preceding run.

      • Clear this check box to skip analysis of all existing objects. The job's first run analyzes only those objects that are created or changed after you finish creating the job and before the first run starts. Each subsequent run analyzes only those objects that are created or changed after the preceding run.

        Clearing this check box is helpful for cases where you've already analyzed the data and want to continue to analyze it periodically. For example, if you previously used Amazon Macie Classic to classify data and you recently moved to Macie, you might use this option to ensure continued discovery and classification of your data without incurring unnecessary costs or duplicating classification data.

  2. (Optional) To specify the percentage of objects that you want the job to analyze, enter the percentage in the Sampling depth box. If this value is less than 100%, Macie selects the objects to analyze at random, up to the specified percentage, and analyzes all the data in those objects. The default value is 100%.

  3. (Optional) To add specific criteria that determine which S3 objects are included or excluded from the job's analysis, expand the Additional settings section, and then enter the criteria. These criteria consist of individual conditions that derive from properties of objects.

    • To analyze (include) objects that meet a specific condition, enter the condition type and value, and then choose Include.

    • To skip (exclude) objects that meet a specific condition, enter the condition type and value, and then choose Exclude.

    Repeat this step for each include or exclude condition that you want.

    In Macie, exclude conditions take precedence over include conditions. For example, if you include objects that have the .pdf file name extension and exclude objects that are larger than 5 MB, the job analyzes any object that has the .pdf file name extension, unless the object is larger than 5 MB.

  4. When you finish, choose Next.

Step 4: Select managed data identifiers

For this step, specify which managed data identifiers you want the job to use when it analyzes S3 objects. You can configure the job to use all, some, or none of the managed data identifiers that Macie provides. To review a detailed list of the managed data identifiers that are currently available, see Using managed data identifiers. We update that list each time we release a new managed data identifier.

If you choose to use only some managed data identifiers, Macie displays a table of the managed data identifiers that are currently available. You can use the table to select each managed data identifier that you want the job to use (include) or not use (exclude), depending on the selection type that you choose for the job. In the table, each managed data identifier's ID describes the type of sensitive data that the managed data identifier detects, for example: USA_PASSPORT_NUMBER for US passport numbers, CREDIT_CARD_SECURITY_CODE for credit card verification codes, and PGP_PRIVATE_KEY for PGP private keys. To find specific identifiers more quickly, you can sort and filter the table by sensitive data category and type.

To select managed data identifiers for the job

  1. On the Select managed data identifiers page, under Selection type, do one of the following to specify which managed data identifiers you want the job to use:

    • To use all managed data identifiers, choose All.

      If you choose this option and you configured the job to run more than once, each run will automatically use new managed data identifiers that we release, in addition to all the managed data identifiers that are currently available.

    • To exclude specific managed data identifiers, choose Exclude. Then, in the table that appears, select the check box for each managed data identifier that you don't want the job to use.

      For example, if you don't want the job to detect and report occurrences of mailing addresses, select the ADDRESS check box. If you do this, the job will use all managed data identifiers except the one that detects mailing addresses.

      If you choose the Exclude option and you configured the job to run more than once, each run will automatically use new managed data identifiers that we release, in addition to all the managed data identifiers that are currently available and you didn't explicitly exclude from the job.

    • To include only specific managed data identifiers, choose Include. Then, in the table that appears, select the check box for each managed data identifier that you want the job to use.

      For example, if you want the job to only detect and report occurrences of US passport numbers, select the USA_PASSPORT_NUMBER check box. If you do this, the job won't use any managed data identifiers except the one that detects US passport numbers.

    • To exclude all managed data identifiers, choose None.

      If you choose this option, the job won't use any managed data identifiers. In the next step, configure the job to instead use one or more custom data identifiers that you specify.

  2. When you finish, choose Next.

Step 5: Select custom data identifiers

For this step, optionally select one or more custom data identifiers that you want the job to use when it analyzes S3 objects. The job will use the selected identifiers in addition to any managed data identifiers that you configured the job to use.

To select custom data identifiers for the job

  1. On the Select custom data identifiers page, select the check box for each custom data identifier that you want the job to use. You can select as many as 30 custom data identifiers.

    Tip

    To test or review the settings for a custom data identifier before you select it, choose the link icon ( A box with an arrow ) next to the identifier's name. Macie opens a page that displays the identifier's settings. You can also use this page to test the identifier with sample data. To do this, enter up to 1,000 characters of text in the Sample data box, and then choose Submit. Macie evaluates the sample data by using the identifier, and then reports the number of matches.

  2. When you finish selecting custom data identifiers, choose Next.

Step 6: Enter a name and description

For this step, specify a name and, optionally, a brief description of the job.

To enter a name and description for the job

  1. On the Enter a name and description page, enter a name for the job in the Job name box. The name can contain as many as 500 characters.

  2. (Optional) For Job description, enter a brief description of the job. The description can contain as many as 200 characters.

  3. When you finish, choose Next.

Step 7: Review and create

For this final step, review the configuration settings for the job and verify that they're correct. This is an important step. After you create a job, you can’t change any of its settings. This helps ensure that you have an immutable history of sensitive data findings and discovery results for data privacy and protection audits or investigations that you perform.

Depending on the job's settings, you can also review the total estimated cost (in US Dollars) of running the job once. If you selected specific S3 buckets for the job, the estimate is based on the size and types of objects in the buckets that you selected, and how much of that data the job can analyze. If you specified bucket criteria for the job, the estimate is based on the size and types of objects in as many as 500 buckets that currently match the criteria, and how much of that data the job can analyze. To learn about this estimate, see Forecasting and monitoring costs for sensitive data discovery jobs.

To review and create the job

  1. On the Review and create page, review each setting and verify that it's correct. To change a setting, choose Edit in the section that contains the setting, and then enter the correct setting. You can also use the navigation tabs to go to the page that contains a setting.

  2. When you finish verifying the settings, choose Submit to create and save the job. Macie checks the settings and notifies you of any issues to address.

    Note

    If you haven’t configured a repository for your sensitive data discovery results, Macie displays a warning and doesn't save the job. To address this issue, choose Configure in the Repository for sensitive data discovery results section. Then enter the configuration settings for the repository. To learn how, see Storing and retaining sensitive data discovery results. After you enter the settings, return to the Review and create page and refresh the Repository for sensitive data discovery results section of the page.

    Although we don't recommend it, you can temporarily override the repository requirement and save the job. If you do this, you risk losing discovery results from the job—Macie will retain the results for only 90 days. To temporarily override the requirement, select the check box for the override option.

  3. If Macie notifies you of issues to address, address the issues, and then choose Submit again to create and save the job.

If you configured the job to run once, on a daily basis, or on the current day of the week or month, Macie starts running the job immediately after you save it. Otherwise, Macie prepares to run the job on the specified day of the week or month. To monitor the job, you can check the status of the job.