How automated sensitive data discovery works
When you enable Amazon Macie for your AWS account, Macie creates an AWS Identity and Access Management (IAM) service-linked role for your account in the current AWS Region. The permissions policy for this role allows Macie to call other AWS services and monitor AWS resources on your behalf. By using this role, Macie generates and maintains a complete inventory of your Amazon Simple Storage Service (Amazon S3) buckets in the Region. The inventory includes information about each of your S3 buckets and the objects in the buckets. If you're the Macie administrator for an organization, the inventory includes information about S3 buckets that your member accounts own. For more information, see Managing multiple accounts.
If automated sensitive data discovery is enabled for your Macie account, Macie evaluates the inventory data on a daily basis to identify S3 objects that are eligible for automated discovery. As part of the evaluation, Macie also selects a sampling of representative objects to analyze. Macie then retrieves and analyzes the latest version of each selected object from Amazon S3, inspecting each object for sensitive data.
As the analysis progresses, Macie updates statistics, inventory data, and other information that it provides about your Amazon S3 data. Macie also produces records of the sensitive data it finds and the analysis that it performs. The resulting data provides insight into where Macie found sensitive data in your Amazon S3 data estate, spanning all the S3 buckets that Macie monitors and analyzes for your account. The data can help you assess the security and privacy of your sensitive data, determine where to perform a deeper investigation, and identify cases where remediation is necessary.
For a brief demonstration of how automated sensitive data discovery works, watch the following video:
To configure and use automated sensitive data discovery, your account must be a standalone Macie account or the Macie administrator account for an organization.
Key components
Amazon Macie uses a combination of features and techniques to perform automated sensitive data discovery for your Amazon S3 data. These work together with features and techniques that Macie uses to help you monitor your Amazon S3 data for security and access control.
- Selecting S3 objects to analyze
-
On a daily basis, Macie evaluates your Amazon S3 inventory data to identify S3 objects that are eligible for analysis by automated sensitive data discovery. If you're the Macie administrator for an organization, this includes inventory data for S3 buckets that your member accounts own.
As part of the evaluation, Macie uses sampling techniques to select representative objects to analyze. The techniques define groups of objects that have similar metadata and are likely to have similar content. The groups are based on dimensions such as bucket name, prefix, storage class, file name extension, and last modified date. Macie then selects a representative set of samples from each group, retrieves the latest version of each selected object from Amazon S3, and analyzes each selected object to determine whether the object contains sensitive data. When the analysis is complete, Macie discards its copy of the object.
The sampling strategy prioritizes distributed analyses. In general, it uses a breadth-first approach to your Amazon S3 data estate. Each day, a representative set of S3 objects are selected from as many of your buckets as possible based on the total storage size of all the classifiable objects in your Amazon S3 data estate. For example, if Macie has already analyzed and found sensitive data in objects in one S3 bucket and hasn't yet analyzed objects in another bucket, the latter bucket is a higher priority for analysis. With this approach, you gain broad insight into the sensitivity of your Amazon S3 data more quickly. Depending on the size of your data estate, analysis results can begin to appear within 48 hours of enabling automated sensitive data discovery for your account.
The sampling strategy also prioritizes analysis of different kinds of S3 objects and objects that were recently created or changed. Any single object sample isn’t guaranteed to be conclusive. Therefore, analysis of a diverse set of objects can yield better insight into the types and amount of sensitive data that an S3 bucket might contain. In addition, prioritizing new or recently changed objects helps the analysis adapt to changes to your bucket inventory. For example, if objects are created or changed after a previous analysis, those objects are a higher priority for subsequent analysis. Conversely, if an object was previously analyzed and hasn't changed since that analysis, Macie doesn't analyze the object again. This approach helps you establish sensitivity baselines for individual S3 buckets. Then, as continual, incremental analyses progress for your account, your sensitivity assessments of individual buckets can become increasingly deeper and detailed at a predictable rate.
- Defining the scope of the analyses
-
By default, Macie includes all the S3 buckets that it monitors and analyzes for your account when it evaluates your inventory data and selects S3 objects to analyze. If you're the Macie administrator for an organization, this includes S3 buckets that your member accounts own.
You can exclude specific S3 buckets from the analyses. For example, you might prefer to exclude buckets that typically store AWS logging data, such as AWS CloudTrail event logs. To exclude a bucket, you can change the automated sensitive data discovery settings for your account or the bucket. If you do this, Macie starts excluding the bucket when the next daily evaluation and analysis cycle starts. You can exclude as many as 1,000 buckets from the analyses.
If you exclude a bucket, you can subsequently include it again. To do this, change the automated sensitive data discovery settings for your account or the bucket again. Macie then starts including the bucket when the next daily evaluation and analysis cycle starts.
- Determining which types of sensitive data to detect and report
-
By default, Macie inspects S3 objects by using the set of managed data identifiers that we recommend for automated sensitive data discovery. For a list of these managed data identifiers, see Default settings for automated sensitive data discovery.
You can tailor the analyses to focus on specific types of sensitive data. To do this, change the automated sensitive data discovery settings for your account in any of the following ways:
-
Add or remove specific managed data identifiers – A managed data identifier is a set of built-in criteria and techniques that are designed to detect a specific type of sensitive data, such as bank account numbers, AWS secret access keys, or passport numbers for a particular country or region. For more information, see Using managed data identifiers.
-
Add or subsequently remove custom data identifiers – A custom data identifier is a set of criteria that you define to detect sensitive data. With custom data identifiers, you can detect sensitive data that reflects your organization's particular scenarios, intellectual property, or proprietary data, such as employee IDs, customer account numbers, or internal data classifications. For more information, see Building custom data identifiers.
-
Add or subsequently remove allow lists – In Macie, an allow list specifies text or a text pattern that you want Macie to ignore in S3 objects, typically sensitive data exceptions for your particular scenarios or environment, such as public names or phone numbers for your organization, or sample data that your organization uses for testing. For more information, see Defining sensitive data exceptions with allow lists.
If you change the settings, Macie applies your changes when the next daily analysis cycle starts.
You can also adjust bucket-level settings that determine whether specific types of sensitive data are included in assessments of a bucket's sensitivity. To learn how, see Managing automated sensitive data discovery for individual S3 buckets.
-
- Calculating sensitivity scores
-
By default, Macie automatically calculates a sensitivity score for each S3 bucket that it monitors and analyzes for your account. If you're the Macie administrator for an organization, this includes S3 buckets that your member accounts own.
In Macie, a sensitivity score is a quantitative measure of the intersection of two primary dimensions: the amount of sensitive data that Macie has found in a bucket, and the amount of data that Macie has analyzed in a bucket. A bucket's sensitivity score determines which sensitivity label Macie assigns to the bucket. A sensitivity label is a qualitative representation of a bucket's sensitivity score—for example, Sensitive, Not sensitive, and Not yet analyzed. For details about the range of sensitivity scores and labels that Macie defines, see Sensitivity scoring for S3 buckets.
Important An S3 bucket's sensitivity score and label don't imply or otherwise indicate the criticality or importance that the bucket or the bucket's objects might have for your organization. Instead, they're intended to provide reference points that can help you identify and monitor potential security risks.
When you initially enable automated sensitive data discovery for your account, Macie automatically assigns a sensitivity score of 50 to each S3 bucket and it applies the Not yet analyzed label to each bucket. The exception is empty buckets. An empty bucket doesn't contain any objects or all the bucket's objects contain zero (0) bytes of data. If this is the case for a bucket, Macie assigns a score of 1 to the bucket and applies the Not sensitive label to the bucket.
As automated discovery progresses for your account, Macie updates the score and corresponding label to reflect the results of the analyses. For example:
-
If Macie finds sensitive data in a bucket's objects, Macie increases the bucket's sensitivity score and updates the bucket's sensitivity label as necessary.
-
If Macie doesn't find sensitive data in a bucket's objects, Macie decreases the bucket's sensitivity score and updates the bucket's sensitivity label as necessary.
-
If Macie finds sensitive data in an object that's subsequently deleted, Macie removes sensitive data detections for that object from the bucket's sensitivity score and updates the bucket's sensitivity label as necessary.
You can adjust the sensitivity scoring settings for individual S3 buckets by including or excluding specific types of sensitive data from a bucket's score. You can also override a bucket's calculated score by manually assigning the maximum score (100) to the bucket. If you assign the maximum score, the bucket is labeled Sensitive. For more information, see Managing automated discovery for individual S3 buckets.
-
- Generating metadata, statistics, and results
-
When automated sensitive data discovery is enabled for your account, Macie automatically generates and maintains additional inventory data, statistics, and other information about the S3 buckets that it monitors and analyzes for your account. If you're the Macie administrator for an organization, this includes buckets that your member accounts own.
The additional information captures the results of the automated sensitive data discovery activities that Macie has performed thus far for your account. It also supplements other information that Macie provides about your Amazon S3 data, such as the public access and shared access settings for individual buckets. The additional information includes:
-
Aggregated data sensitivity statistics, such as the total number of buckets that Macie has found sensitive data in and how many of those buckets are publicly accessible.
-
An interactive, visual representation of data sensitivity across your Amazon S3 data estate.
-
Bucket-level details that indicate the current status of the analyses, such as a list of the objects that Macie has analyzed in a bucket, the types of sensitive data that Macie has found in a bucket, and the number of occurrences of each type of sensitive data that Macie found.
For more information, see Reviewing automated sensitive data discovery statistics and results.
Macie automatically recalculates and updates this information while performing automated sensitive data discovery for your account. For example, if Macie finds sensitive data in an object that's subsequently deleted, Macie updates the applicable bucket's metadata: removes the object from the list of analyzed objects; removes occurrences of sensitive data that Macie found in the object; recalculates the sensitivity score, if the score is calculated automatically; and, updates the sensitivity label as necessary to reflect the new score.
In addition to metadata and statistics, Macie produces records of the sensitive data it finds and the analysis that it performs: sensitive data findings, which report sensitive data that Macie finds in individual S3 objects, and sensitive data discovery results, which log details about the analysis of individual S3 objects.
-
Considerations
As you use Amazon Macie to perform automated sensitive data discovery for your Amazon S3 data, keep the following in mind:
-
Your automated discovery settings apply only to the current AWS Region. In addition, the resulting analyses and data apply only to S3 buckets and objects in the current Region. To perform automated discovery and access the resulting data in additional Regions, enable and configure automated discovery in each additional Region.
-
If you're the Macie administrator for an organization:
-
You can perform automated discovery for a member account only if Macie is enabled for the account in the current Region. Member accounts can't perform automated discovery for their own accounts.
-
Member accounts can't access automated discovery settings that apply to their S3 buckets. Only the Macie administrator can access these settings.
-
Member accounts can't access sensitive data discovery statistics and other results that Macie directly provides for their S3 buckets. For example, a member account can't use the Amazon Macie console to review sensitivity scores for their S3 buckets. Only the Macie administrator can access this data.
-
-
If an S3 bucket's permissions settings prevent Macie from retrieving information about or accessing the bucket or the bucket’s objects, Macie can't perform automated discovery for the bucket.
For these cases, Macie can only provide a subset of information about the bucket, such as the account ID for the AWS account that owns the bucket, the bucket's name and Region, and the date and time when Macie most recently retrieved both bucket and object metadata for the bucket as part of the daily refresh cycle. In your bucket inventory, the sensitivity score for these buckets is 50 and their sensitivity label is Not yet analyzed. To investigate the issue, review the bucket’s policy and permissions settings in Amazon S3. For example, the bucket might have a restrictive bucket policy. For more information, see Allowing Macie to access S3 buckets and objects.
-
To be eligible for selection and analysis, an S3 object must be classifiable. A classifiable object uses a supported Amazon S3 storage class and it has a file name extension for a supported file or storage format. For more information, see Supported storage classes and formats.
-
Macie can analyze an encrypted S3 object only if the object is encrypted with a key that Macie is allowed to use. For more information, see Analyzing encrypted S3 objects.