Locating sensitive data with Amazon Macie findings - Amazon Macie

Locating sensitive data with Amazon Macie findings

When you run a sensitive data discovery job, Amazon Macie captures details about the location of each occurrence of sensitive data that it finds in an Amazon S3 object. This includes sensitive data that Macie detects using managed data identifiers, and data that matches any custom data identifiers that you configure a sensitive data discovery job to use.

With sensitive data findings, you can view these details for as many as 15 occurrences of sensitive data that Macie detects when it runs a job. The details provide insight into the breadth of the categories and types of sensitive data that specific S3 buckets and objects contain. They can also help you locate individual occurrences of sensitive data and determine whether to perform a deeper investigation of specific buckets and objects.

To help you locate an occurrence of sensitive data, a finding can provide details such as:

  • The column and row number for a cell or field in a Microsoft Excel workbook, CSV file, or TSV file.

  • The path to a field or array in a JSON or JSON Lines file.

  • The line number for a line in a non-binary text file other than a CSV, JSON, JSON Lines, or TSV file—for example, an HTML, TXT, or XML file.

  • The page number for a page in an Adobe Portable Document Format (PDF) file.

  • The record index and the path to a field in a record in an Apache Avro object container or Apache Parquet file.

You can access these details by using the Amazon Macie console and the Amazon Macie API. You can also access these details in findings that Macie publishes to other AWS services, both Amazon EventBridge and AWS Security Hub.

If an S3 object contains many occurrences of sensitive data, you can also use a finding to navigate to the corresponding sensitive data discovery result for the finding. Unlike a sensitive data finding, a sensitive data discovery result provides detailed location data for as many as 1,000 occurrences of each type of sensitive data that Macie detects in an object. If an S3 object is an archive file, such as a .tar or .zip file, a sensitive data discovery result also provides detailed location data for occurrences of sensitive data in individual files that Macie extracts from the archive file. (Macie doesn’t include this information in sensitive data findings.) For more information about sensitive data discovery results, see Reviewing job statistics and results. Macie uses the same schema for location data in sensitive data findings and sensitive data discovery results.

The topics in this section explain how to locate occurrences of sensitive data by using sensitive data findings and the Amazon Macie console. They also explain the schema that Macie uses to store and report the location of individual occurrences of sensitive data. To access location data programmatically, you can use the Findings Descriptions resource of the Amazon Macie API. To learn how to access the data in findings that Macie publishes to other AWS services, see Monitoring and processing findings.

Locating occurrences of sensitive data

When you run a sensitive data discovery job, Macie performs a deep inspection of the latest version of each S3 object that you configure the job to analyze. Macie also uses a depth-first search algorithm to populate the job's findings with details about the location of 1–15 occurrences of the sensitive data that Macie detects. These occurrences provide insight into the categories and types of sensitive data that the affected S3 buckets and objects contain. You can use these details to locate individual occurrences of sensitive data and determine whether to perform a deeper investigation of specific buckets and objects.

To locate occurrences of sensitive data

  1. Open the Macie console at https://console.aws.amazon.com/macie/.

  2. In the navigation pane, choose Findings.

    Tip

    You can also use the Jobs page to display all the findings from a particular job. To do this, choose Jobs in the navigation pane, and then choose the name of the job. At the top of the details panel, choose Show results, and then choose Show findings.

  3. On the Findings page, choose the finding for the sensitive data that you want to locate. The details panel displays information for the finding.

  4. In the details panel, scroll to the Details section. This section provides information about the categories and types of sensitive data that Macie found in the affected S3 object.

    If the finding includes details about where Macie found a type of sensitive data, an Occurrences field appears and summarizes those details, as shown in the following image.

    
						The finding details panel with three Occurrences
							fields. Each field contains a link that shows the number of occurrences that
							the finding provides location details for.

    To show the details for a specific type of sensitive data, choose the link in the Occurrences field. Macie opens a new window and displays the details in JSON format. To then save the details as a JSON file, choose Download and specify a name and location for the file.

  5. (Optional) To save all the finding's details as a JSON file, choose the finding's identifier (Finding ID) at the top of the details panel. Macie opens a new window and displays all the details in JSON format. Choose Download, and then specify a name and location for the file.

To access details about the location of as many as 1,000 occurrences of each type of sensitive data in an affected object, you can refer to the corresponding sensitive data discovery result for the finding. To help you do this, the details panel provides a link to the discovery result. In the details panel, scroll to the Details section of the panel, and then choose the link in the Detailed result location field. Macie opens the Amazon S3 console and displays the file or folder that contains the discovery result. To learn more about these results, see Reviewing job statistics and results.

JSON schema for sensitive data locations

Macie uses standardized JSON structures to store information about where it finds sensitive data in S3 objects. These structures are used by sensitive data findings and sensitive data discovery results. For sensitive data findings, the structures are part of the JSON schema for Macie findings. To view the complete JSON schema for Macie findings, see Findings Descriptions in the Amazon Macie API Reference.

The JSON schema for a sensitive data finding includes one customDataIdentifiers object and one sensitiveData object. The customDataIdentifiers object provides details about data that Macie detected using custom data identifiers. The sensitiveData object provides details about sensitive data that Macie detected using managed data identifiers.

Each customDataIdentifiers and sensitiveData object contains one or more detections arrays:

  • In a customDataIdentifiers object, the detections array indicates the custom data identifiers that detected the data and produced the finding. For each custom data identifier, the array also indicates the number of occurrences of the data that the identifier detected. It can also indicate the location of the data that the identifier detected.

  • In a sensitiveData object, a detections array indicates the types of sensitive data that Macie detected using managed data identifiers. For each type of sensitive data, the array also indicates the number of occurrences of the data, and it can indicate the location of the data.

For a sensitive data finding, a detections array can include 1–15 occurrences objects. Each occurrences object specifies where Macie found individual occurrences of a specific type of sensitive data.

For example, the following detections array indicates the location of three occurrences of sensitive data (US Social Security numbers) in a CSV file.

"sensitiveData": [ { "category": "PERSONAL_INFORMATION", "detections": [ { "count": 30, "occurrences": { "cells": [ { "cellReference": null, "column": 1, "columnName": "SSN", "row": 2 }, { "cellReference": null, "column": 1, "columnName": "SSN", "row": 3 }, { "cellReference": null, "column": 1, "columnName": "SSN", "row": 4 } ] }, "type": "USA_SOCIAL_SECURITY_NUMBER" }

The location and number of occurrences objects in a detections array varies based on the categories, types, and number of occurrences of sensitive data that Macie detects when it runs a sensitive data discovery job. This variation occurs because Macie includes location data for only 1–15 occurrences of the sensitive data that it detects when it runs a job. These 1–15 occurrences are indicative of the categories and types of sensitive data that the affected S3 buckets and objects contain.

An occurrences object can contain any the following structures, depending on an S3 object's file type or storage format:

  • cells array – This array applies to Microsoft Excel workbooks, CSV files, and TSV files. An object in this array specifies a cell or field that contains an occurrence of sensitive data.

  • lineRanges array – This array applies to non-binary text files other than CSV, JSON, JSON Lines, and TSV files—for example, HTML, TXT, and XML files. An object in this array specifies a line or an inclusive range of lines that contains an occurrence of sensitive data, and the position of the data on the specified line or lines.

    In certain cases, an object in a lineRanges array specifies the location of sensitive data in a file type or storage format that's supported by another type of array. Those cases are: sensitive data in an unstructured section of an otherwise structured file, such as a comment in a file; sensitive data in a malformed file that Macie analyzes as plaintext; and, a CSV or TSV file that has one or more column names that contain sensitive data.

  • offsetRanges array – This array is reserved for future use. If this array is present, the value for it is always null.

  • pages array – This array applies to Adobe Portable Document Format (PDF) files. An object in this array specifies a page that contains an occurrence of sensitive data.

  • records array – This array applies to Apache Avro object containers, Apache Parquet files, JSON files, and JSON Lines files. For Avro object containers and Parquet files, an object in this array specifies a record index and the path to a field in a record that contains an occurrence of sensitive data. For JSON and JSON Lines files, an object in this array specifies the path to a field or array that contains an occurrence of sensitive data. For JSON Lines files, it also specifies the index of the line that contains the data.

The contents of these arrays vary based on an affected S3 object's file type or storage format and its contents. The next topic provides details and examples of each array.

JSON details and examples for sensitive data locations

Macie tailors the contents of the JSON structures that it uses to indicate the location of sensitive data in specific types of files and content. The following topics explain and provide examples of these structures.

For a complete list of JSON structures that can be included in a sensitive data finding, see Findings Descriptions in the Amazon Macie API Reference.

Cells array

Applies to: Microsoft Excel workbooks, CSV files, and TSV files

In a cells array, a Cell object specifies a cell or field that contains an occurrence of sensitive data. The following table describes the purpose of each field in a Cell object.

Field Type Description
cellReference String The location of the cell, as an absolute cell reference, that contains the sensitive data. This field applies only to Excel workbooks. This value is null for CSV and TSV files.
column Integer The column number of the column that contains the sensitive data. For an Excel workbook, this value correlates to the alphabetical character(s) for a column identifier—for example, 1 for column A, 2 for column B, and so on.
columnName String The name of the column that contains the sensitive data, if available.
row Integer The row number of the row that contains the sensitive data.

The following example shows the structure of a Cell object that reports an occurrence of sensitive data in a CSV file.

"cells": [ { "cellReference": null, "column": 3, "columnName": "SSN", "row": 5 } ]

In the preceding example, the finding indicates that the field in the fifth row of the third column (named SSN) of the file contains sensitive data.

The following example shows the structure of a Cell object that reports an occurrence of sensitive data in an Excel workbook.

"cells": [ { "cellReference": "Sheet2!C5", "column": 3, "columnName": "SSN", "row": 5 } ]

In the preceding example, the finding indicates that the worksheet named Sheet2 in the workbook contains sensitive data. In that worksheet, the sensitive data is in the cell in the fifth row of the third column (column C, named SSN).

LineRanges array

Applies to: Non-binary text files other than CSV, JSON, JSON Lines, and TSV files—for example, HTML, TXT, and XML files

In a lineRanges array, a Range object specifies a line or an inclusive range of lines that contains an occurrence of sensitive data, and the position of the data on the specified line or lines.

This object is often empty for file types that are supported by other types of arrays in occurrences objects. Exceptions are:

  • Data in unstructured sections of an otherwise structured file, such as a comment in a file.

  • Data in a malformed file that Macie analyzes as plaintext.

  • A CSV or TSV file that has one or more column names that contain sensitive data.

The following table describes the purpose of each field in a Range object of a lineRanges array.

Field Type Description
end Integer The number of lines from the beginning of the file to the end of the sensitive data.
start Integer The number of lines from the beginning of the file to the beginning of the sensitive data.
startColumn Integer The number of characters, with spaces and starting from 1, from the beginning of the first line that contains the sensitive data (start) to the beginning of the sensitive data.

The following example shows the structure of a Range object that reports an occurrence of sensitive data that's stored on a single line in a TXT file.

"lineRanges": [ { "end": 1, "start": 1, "startColumn": 119 } ]

In the preceding example, the finding indicates that the first line of the file contains a complete occurrence of sensitive data (a mailing address). The first character in the occurrence is 119 characters (with spaces) from the beginning of that line.

The following example shows the structure of a Range object that reports an occurrence of sensitive data that spans multiple lines in a TXT file.

"lineRanges": [ { "end": 54, "start": 51, "startColumn": 1 } ]

In the preceding example, the finding indicates that lines 51 through 54 of the file contain an occurrence of sensitive data (a mailing address). The first character in the occurrence is the first character on line 51 of the file.

Pages array

Applies to: Adobe Portable Document Format (PDF) files

In a pages array, a Page object specifies a page that contains an occurrence of sensitive data. The object contains a pageNumber field. The pageNumber field stores an integer that specifies the page number of the page that contains the sensitive data.

The following example shows the structure of a Page object that reports an occurrence of sensitive data in a PDF file.

"pages": [ { "pageNumber": 10 } ]

In the preceding example, the finding indicates that page 10 of the file contains sensitive data.

Records array

Applies to: Apache Avro object containers, Apache Parquet files, JSON files, and JSON Lines files

For an Avro object container or a Parquet file, a Record object in a records array specifies a record index and the path to a field in a record that contains an occurrence of sensitive data. For JSON and JSON Lines files, a Record object specifies the path to a field or array that contains an occurrence of sensitive data. For JSON Lines files, it also specifies the index of the line that contains the data.

The following table describes the purpose of each field in a Record object.

Field Type Description
jsonPath String

The path, as a JSONPath expression, to the sensitive data.

For an Avro object container or a Parquet file, this is the path to the field in the record (recordIndex) that contains the data. For a JSON or JSON Lines file, this is the path to the field or array that contains the data. If the data is a value in an array, the path also indicates which value contains the data.

If Macie detects sensitive data in the name of any element in the path, Macie omits the jsonPath field from a Record object. If the name of a path element exceeds 20 characters, Macie truncates the name by removing characters from the beginning of the name. If the resulting full path exceeds 250 characters, Macie also truncates the path, starting with the first element in the path, until the path contains 250 or fewer characters.

recordIndex Integer For an Avro object container or a Parquet file, the record index, starting from 0, for the record that contains the sensitive data. For a JSON Lines file, the line index, starting from 0, for the line that contains the sensitive data. This value is always 0 for JSON files.

The following example shows the structure of a Record object that reports an occurrence of sensitive data in a Parquet file. In this example, Macie truncated the name of the field that contains the data, specified in the jsonPath field, to meet the character limit.

"records": [ { "jsonPath": "$['…hijklmnopqrstuvwxyz']", "recordIndex": 7663 } ]

In the preceding example, the finding indicates that the record of index 7663 (record number 7664) contains sensitive data. In that record, the sensitive data is in the field whose name ends with hijklmnopqrstuvwxyz. The full JSON path to the field in the record is $.abcdefghijklmnopqrstuvwxyz.

The following example also shows the structure of a Record object that reports an occurrence of sensitive data in a Parquet file. In this example, Macie truncated both the full path and the name of the field that contains the data.

"records": [ { "jsonPath": "$..usssn2.usssn3.usssn4.usssn5.usssn6.usssnfield7.usssn8.usssn9.usssn10.usssn11.usssn12.usssn13.usssn14.usssn15.usssn16.usssn17.usssn18.usssn19.usssn20.usssn21.usssn22.usssn23.usssn24.usssn25.usssn26.usssn27.usssn28.usssn29['…hijklmnopqrstuvwxyz']", "recordIndex": 2335 } ]

In the preceding example, the finding indicates that the record of index 2335 (record number 2336) contains sensitive data. In that record, the sensitive data is in the field whose name ends with hijklmnopqrstuvwxyz. The full JSON path to the field in the record is: $['1234567890']usssn1.usssn2.usssn3.usssn4.usssn5.usssn6.usssnfield7.usssn8.usssn9.usssn10.usssn11.usssn12.usssn13.usssn14.usssn15.usssn16.usssn17.usssn18.usssn19.usssn20.usssn21.usssn22.usssn23.usssn24.usssn25.usssn26. usssn27.usssn28.usssn29['abcdefghijklmnopqrstuvwxyz']

The following example shows the structure of a Record object that reports an occurrence of sensitive data in a JSON file. In this example, the sensitive data is a specific value in an array.

"records": [ { "jsonPath": "$.access.key[2]", "recordIndex": 0 } ]

In the preceding example, the finding indicates that the second value in an array named key contains sensitive data. The array is a child of an object named access.

The following example shows the structure of a Record object that reports an occurrence of sensitive data in a JSON Lines file.

"records": [ { "jsonPath": "$.access.key", "recordIndex": 3 } ]

In the preceding example, the finding indicates that the third value (line) in the file contains sensitive data. In that line, the sensitive data is in a field named key, which is a child of an object named access.