Locating sensitive data with Amazon Macie findings
When you run a sensitive data discovery job, Amazon Macie captures details about the location of each occurrence of sensitive data that it finds in an Amazon S3 object. This includes sensitive data that Macie detects using managed data identifiers, and data that matches the criteria of any custom data identifiers that you configure the job to use.
With sensitive data findings, you can view these details for as many as 15 occurrences of sensitive data that Macie detects when it runs a job. The details provide insight into the breadth of the categories and types of sensitive data that specific S3 buckets and objects contain. They can also help you locate individual occurrences of sensitive data and determine whether to perform a deeper investigation of specific buckets and objects.
To help you locate an occurrence of sensitive data, a finding can provide details such as:
-
The column and row number for a cell or field in a Microsoft Excel workbook, CSV file, or TSV file.
-
The path to a field or array in a JSON or JSON Lines file.
-
The line number for a line in a non-binary text file other than a CSV, JSON, JSON Lines, or TSV file—for example, an HTML, TXT, or XML file.
-
The page number for a page in an Adobe Portable Document Format (PDF) file.
-
The record index and the path to a field in a record in an Apache Avro object container or Apache Parquet file.
You can access these details by using the Amazon Macie console or the Amazon Macie API. You can also access these details in findings that Macie publishes to other AWS services, both Amazon EventBridge and AWS Security Hub.
If an S3 object contains many occurrences of sensitive data, you can also use a finding to navigate to the corresponding sensitive data discovery result for the finding. Unlike a sensitive data finding, a sensitive data discovery result provides detailed location data for as many as 1,000 occurrences of each type of sensitive data that Macie detects in an object. If an S3 object is an archive file, such as a .tar or .zip file, a sensitive data discovery result also provides detailed location data for occurrences of sensitive data in individual files that Macie extracts from the archive file. (Macie doesn’t include this information in sensitive data findings.) For more information about sensitive data discovery results, see Reviewing job statistics and results. Macie uses the same schema for location data in sensitive data findings and sensitive data discovery results.
The topics in this section explain how to locate occurrences of sensitive data by using sensitive data findings and the Amazon Macie console. They also explain the schema that Macie uses to store and report the location of individual occurrences of sensitive data. To access location data programmatically, you can use the GetFindings operation of the Amazon Macie API. To learn how to access the data in findings that Macie publishes to other AWS services, see Monitoring and processing findings.
Topics
Locating occurrences of sensitive data
When you run a sensitive data discovery job, Macie performs a deep inspection of the latest version of each S3 object that you configure the job to analyze. Macie also uses a depth-first search algorithm to populate the job's findings with details about the location of 1–15 occurrences of sensitive data that Macie finds. These occurrences provide insight into the categories and types of sensitive data that the affected S3 buckets and objects contain. You can use these details to locate individual occurrences of sensitive data and determine whether to perform a deeper investigation of specific buckets and objects.
To locate occurrences of sensitive data
Open the Amazon Macie console at https://console.aws.amazon.com/macie/
. -
In the navigation pane, choose Findings.
Tip You can use the Jobs page to display all the findings from a particular job. To do this, choose Jobs in the navigation pane, and then choose the name of the job. At the top of the details panel, choose Show results, and then choose Show findings.
-
On the Findings page, choose the finding for the sensitive data that you want to locate. The details panel displays information for the finding.
-
In the details panel, scroll to the Details section. This section provides information about the categories and types of sensitive data that Macie found in the affected S3 object. It also indicates the number of occurrences of each type of sensitive data that Macie found.
For example, the following image shows some details of a finding that reports 30 occurrences of credit card numbers, 30 occurrences of names, and 30 occurrences of US Social Security numbers.
If the finding includes details about the location of one or more occurrences of a specific type of sensitive data, the number of occurrences is a link. Choose the link to show the details. Macie opens a new window and displays the details in JSON format.
For example, the following image shows the location of two occurrences of credit card numbers in an affected object.
To save the details as a JSON file, choose Download, and then and specify a name and location for the file.
-
(Optional) To save all the finding's details as a JSON file, choose the finding's identifier (Finding ID) at the top of the details panel. Macie opens a new window and displays all the details in JSON format. Choose Download, and then specify a name and location for the file.
To access details about the location of as many as 1,000 occurrences of each type of sensitive data in an affected object, you can refer to the corresponding sensitive data discovery result for the finding. To do this, scroll to the beginning of the Details section of the panel, and then choose the link in the Detailed result location field. Macie opens the Amazon S3 console and displays the file or folder that contains the discovery result. To learn more about these results, see Reviewing job statistics and results.
JSON schema for sensitive data locations
Macie uses standardized JSON structures to store information about where it finds sensitive data in S3 objects. These structures are used by sensitive data findings and sensitive data discovery results. For sensitive data findings, the structures are part of the JSON schema for Macie findings. To view the complete JSON schema for Macie findings, see Findings in the Amazon Macie API Reference.
The JSON schema for a sensitive data finding includes one customDataIdentifiers
object and one sensitiveData
object. The customDataIdentifiers
object provides details about data that Macie detected using custom data identifiers. The
sensitiveData
object provides details about sensitive data that Macie
detected using managed data
identifiers.
Each customDataIdentifiers
and sensitiveData
object contains
one or more detections
arrays:
-
In a
customDataIdentifiers
object, thedetections
array indicates which custom data identifiers detected the data and produced the finding. For each custom data identifier, the array also indicates the number of occurrences of the data that the identifier detected. It can also indicate the location of the data that the identifier detected. -
In a
sensitiveData
object, adetections
array indicates the types of sensitive data that Macie detected using managed data identifiers. For each type of sensitive data, the array also indicates the number of occurrences of the data, and it can indicate the location of the data.
For a sensitive data finding, a detections
array can include 1–15
occurrences
objects. Each occurrences
object specifies where
Macie found individual occurrences of a specific type of sensitive data.
For example, the following detections
array indicates the location of three
occurrences of sensitive data (US Social Security numbers) in a CSV file.
"sensitiveData": [
{
"category": "PERSONAL_INFORMATION",
"detections": [
{
"count": 30,
"occurrences": {
"cells": [
{
"cellReference": null,
"column": 1,
"columnName": "SSN",
"row": 2
},
{
"cellReference": null,
"column": 1,
"columnName": "SSN",
"row": 3
},
{
"cellReference": null,
"column": 1,
"columnName": "SSN",
"row": 4
}
]
},
"type": "USA_SOCIAL_SECURITY_NUMBER"
}
The location and number of occurrences
objects in a detections
array varies based on the categories, types, and number of occurrences of sensitive data
that Macie detects when it runs a sensitive data discovery job. This variation occurs
because Macie includes location data for only 1–15 occurrences of the sensitive data
that it detects when it runs a job. These 1–15 occurrences are indicative of the
categories and types of sensitive data that the affected S3 buckets and objects
contain.
An occurrences
object can contain any the following structures, depending on an
S3 object's file type or storage format:
-
cells
array – This array applies to Microsoft Excel workbooks, CSV files, and TSV files. An object in this array specifies a cell or field that contains an occurrence of sensitive data. -
lineRanges
array – This array applies to non-binary text files other than CSV, JSON, JSON Lines, and TSV files—for example, HTML, TXT, and XML files. An object in this array specifies a line or an inclusive range of lines that contains an occurrence of sensitive data, and the position of the data on the specified line or lines.In certain cases, an object in a
lineRanges
array specifies the location of sensitive data in a file type or storage format that's supported by another type of array. Those cases are: sensitive data in an unstructured section of an otherwise structured file, such as a comment in a file; sensitive data in a malformed file that Macie analyzes as plaintext; and, a CSV or TSV file that has one or more column names that contain sensitive data. -
offsetRanges
array – This array is reserved for future use. If this array is present, the value for it is always null. -
pages
array – This array applies to Adobe Portable Document Format (PDF) files. An object in this array specifies a page that contains an occurrence of sensitive data. -
records
array – This array applies to Apache Avro object containers, Apache Parquet files, JSON files, and JSON Lines files. For Avro object containers and Parquet files, an object in this array specifies a record index and the path to a field in a record that contains an occurrence of sensitive data. For JSON and JSON Lines files, an object in this array specifies the path to a field or array that contains an occurrence of sensitive data. For JSON Lines files, it also specifies the index of the line that contains the data.
The contents of these arrays vary based on an affected S3 object's file type or storage format and its contents. The next topic provides details and examples of each array.
JSON details and examples for sensitive data locations
Macie tailors the contents of the JSON structures that it uses to indicate the location of sensitive data in specific types of files and content. The following topics explain and provide examples of these structures.
For a complete list of JSON structures that can be included in a sensitive data finding, see Findings in the Amazon Macie API Reference.
Cells array
Applies to: Microsoft Excel workbooks, CSV files, and TSV files
In a cells
array, a Cell
object specifies a cell or field
that contains an occurrence of sensitive data. The following table describes the purpose
of each field in a Cell
object.
Field | Type | Description |
---|---|---|
cellReference |
String | The location of the cell, as an absolute cell reference, that contains the sensitive data. This field applies only to Excel workbooks. This value is null for CSV and TSV files. |
column |
Integer | The column number of the column that contains the sensitive data. For an
Excel workbook, this value correlates to the alphabetical character(s) for a
column identifier—for example, 1 for column A,
2 for column B, and so on. |
columnName |
String | The name of the column that contains the sensitive data, if available. |
row |
Integer | The row number of the row that contains the sensitive data. |
The following example shows the structure of a Cell
object that reports
an occurrence of sensitive data in a CSV file.
"cells": [
{
"cellReference": null,
"column": 3,
"columnName": "SSN",
"row": 5
}
]
In the preceding example, the finding indicates that the field in the fifth row of the third column (named SSN) of the file contains sensitive data.
The following example shows the structure of a Cell
object that reports
an occurrence of sensitive data in an Excel workbook.
"cells": [
{
"cellReference": "Sheet2!C5",
"column": 3,
"columnName": "SSN",
"row": 5
}
]
In the preceding example, the finding indicates that the worksheet named Sheet2 in the workbook contains sensitive data. In that worksheet, the sensitive data is in the cell in the fifth row of the third column (column C, named SSN).
LineRanges array
Applies to: Non-binary text files other than CSV, JSON, JSON Lines, and TSV files—for example, HTML, TXT, and XML files
In a lineRanges
array, a Range
object specifies a line or
an inclusive range of lines that contains an occurrence of sensitive data, and the
position of the data on the specified line or lines.
This object is often empty for file types that are supported by other types of arrays
in occurrences
objects. Exceptions are:
-
Data in unstructured sections of an otherwise structured file, such as a comment in a file.
-
Data in a malformed file that Macie analyzes as plaintext.
-
A CSV or TSV file that has one or more column names that contain sensitive data.
The following table describes the purpose of each field in a Range
object of a lineRanges
array.
Field | Type | Description |
---|---|---|
end |
Integer | The number of lines from the beginning of the file to the end of the sensitive data. |
start |
Integer | The number of lines from the beginning of the file to the beginning of the sensitive data. |
startColumn |
Integer | The number of characters, with spaces and starting from 1, from the
beginning of the first line that contains the sensitive data
(start ) to the beginning of the sensitive data. |
The following example shows the structure of a Range
object that reports
an occurrence of sensitive data that's stored on a single line in a TXT file.
"lineRanges": [
{
"end": 1,
"start": 1,
"startColumn": 119
}
]
In the preceding example, the finding indicates that the first line of the file contains a complete occurrence of sensitive data (a mailing address). The first character in the occurrence is 119 characters (with spaces) from the beginning of that line.
The following example shows the structure of a Range
object that reports
an occurrence of sensitive data that spans multiple lines in a TXT file.
"lineRanges": [
{
"end": 54,
"start": 51,
"startColumn": 1
}
]
In the preceding example, the finding indicates that lines 51 through 54 of the file contain an occurrence of sensitive data (a mailing address). The first character in the occurrence is the first character on line 51 of the file.
Pages array
Applies to: Adobe Portable Document Format (PDF) files
In a pages
array, a Page
object specifies a page that contains an
occurrence of sensitive data. The
object contains a pageNumber
field. The pageNumber
field
stores an integer that specifies the page number of the page that contains the sensitive
data.
The following example shows the structure of a Page
object that reports
an occurrence of sensitive data in a PDF file.
"pages": [
{
"pageNumber": 10
}
]
In the preceding example, the finding indicates that page 10 of the file contains sensitive data.
Records array
Applies to: Apache Avro object containers, Apache Parquet files, JSON files, and JSON Lines files
For an Avro object container or a Parquet file, a Record
object in a
records
array specifies a record index and the path to a field in a
record that contains an occurrence of sensitive data. For JSON and JSON Lines files, a
Record
object specifies the path to a field or array that contains an
occurrence of sensitive data. For JSON Lines files, it also specifies the index of the
line that contains the data.
The following table describes the purpose of each field in a Record
object.
Field | Type | Description |
---|---|---|
jsonPath |
String |
The path, as a JSONPath expression, to the sensitive data. For an Avro object container or a Parquet file, this is the path to
the field in the record ( If Macie detects sensitive data in the name of any element in the
path, Macie omits the |
recordIndex |
Integer | For an Avro object container or a Parquet file, the record index,
starting from 0, for the record that contains the sensitive data. For a JSON
Lines file, the line index, starting from 0, for the line that contains the
sensitive data. This value is always 0 for JSON files. |
The following example shows the structure of a Record
object that
reports an occurrence of sensitive data in a Parquet file. In this example, Macie
truncated the name of the field that contains the data, specified in the
jsonPath
field, to meet the character limit.
"records": [
{
"jsonPath": "$['…hijklmnopqrstuvwxyz']",
"recordIndex": 7663
}
]
In the preceding example, the finding indicates that the record of index 7663 (record
number 7664) contains sensitive data. In that record, the sensitive data is in the field
whose name ends with hijklmnopqrstuvwxyz
. The full JSON path to the field
in the record is $.abcdefghijklmnopqrstuvwxyz
.
The following example also shows the structure of a Record
object that
reports an occurrence of sensitive data in a Parquet file. In this example, Macie
truncated both the full path and the name of the field that contains the data.
"records": [
{
"jsonPath": "$..usssn2.usssn3.usssn4.usssn5.usssn6.usssnfield7.usssn8.usssn9.usssn10.usssn11.usssn12.usssn13.usssn14.usssn15.usssn16.usssn17.usssn18.usssn19.usssn20.usssn21.usssn22.usssn23.usssn24.usssn25.usssn26.usssn27.usssn28.usssn29['…hijklmnopqrstuvwxyz']",
"recordIndex": 2335
}
]
In the preceding example, the finding indicates that the record of index 2335 (record
number 2336) contains sensitive data. In that record, the sensitive data is in the field
whose name ends with hijklmnopqrstuvwxyz
. The full JSON path to the field
in the record is:
$['1234567890']usssn1.usssn2.usssn3.usssn4.usssn5.usssn6.usssnfield7.usssn8.usssn9.usssn10.usssn11.usssn12.usssn13.usssn14.usssn15.usssn16.usssn17.usssn18.usssn19.usssn20.usssn21.usssn22.usssn23.usssn24.usssn25.usssn26.
usssn27.usssn28.usssn29['abcdefghijklmnopqrstuvwxyz']
The following example shows the structure of a Record
object that
reports an occurrence of sensitive data in a JSON file. In this example, the sensitive
data is a specific value in an array.
"records": [
{
"jsonPath": "$.access.key[2]",
"recordIndex": 0
}
]
In the preceding example, the finding indicates that the second value in an array
named key
contains sensitive data. The array is a child of an object named
access
.
The following example shows the structure of a Record
object that
reports an occurrence of sensitive data in a JSON Lines file.
"records": [
{
"jsonPath": "$.access.key",
"recordIndex": 3
}
]
In the preceding example, the finding indicates that the third value (line) in the
file contains sensitive data. In that line, the sensitive data is in a field named
key
, which is a child of an object named access
.