Redacting PII in your batch job
When redacting personally identifiable information (PII) from a transcript during a batch
transcription job, Amazon Transcribe replaces each identified instance of PII with
[PII]
in the main text body of your transcript. You can also view the type of PII
that is redacted in the word-for-word portion of the transcription output. For an output sample,
see Example redacted output (batch).
Redaction with batch transcriptions is available with US English (en-US
) and US Spanish (es-US
).
Redaction is not compatible with language identification.
Both redacted and unredacted transcripts are stored in the same output Amazon S3 bucket. Amazon Transcribe stores them in a bucket you specify or in the default Amazon S3 bucket managed by the service.
Types of PII Amazon Transcribe can recognize for batch
transcriptions | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PII type | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ADDRESS |
A physical address, such as 100 Main Street, Anytown, USA or Suite #12, Building 123. An address can include a street, building, location, city, state, country, county, zip, precinct, neighborhood, and more. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ALL |
Redact or identify all PII types listed in this table. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BANK_ACCOUNT_NUMBER |
A US bank account number. These are typically between 10 - 12 digits long, but Amazon Transcribe also recognizes bank account numbers when only the last 4 digits are present. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BANK_ROUTING |
A US bank account routing number. These are typically 9 digits long, but Amazon Transcribe also recognizes routing numbers when only the last 4 digits are present. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CREDIT_DEBIT_CVV |
A 3-digit card verification code (CVV) that is present on VISA, MasterCard, and Discover credit and debit cards. In American Express credit or debit cards, it is a 4-digit numeric code. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CREDIT_DEBIT_EXPIRY |
The expiration date for a credit or debit card. This number is usually 4 digits long and formatted as month/year or MM/YY. For example, Amazon Transcribe can recognize expiration dates such as 01/21, 01/2021, and Jan 2021. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CREDIT_DEBIT_NUMBER |
The number for a credit or debit card. These numbers can vary from 13 to 16 digits in length, but Amazon Transcribe also recognizes credit or debit card numbers when only the last 4 digits are present. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
EMAIL |
An email address, such as efua.owusu@email.com. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NAME |
An individual's name. This entity type does not include titles, such as Mr., Mrs., Miss, or Dr. Amazon Transcribe does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Transcribe recognizes the John Doe Organization as an organization, and Jane Doe Street as an address. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PHONE |
A phone number. This entity type also includes fax and pager numbers. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PIN |
A 4-digit personal identification number (PIN) that allows someone to access their bank account information. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SSN |
A Social Security Number (SSN) is a 9-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Transcribe also recognizes Social Security Numbers when only the last 4 digits are present. |
You can start a batch transcription job using the AWS Management Console, AWS CLI, or AWS SDK.
-
Sign in to the AWS Management Console
. -
In the navigation pane, choose Transcription jobs, then select Create job (top right). This will open the Specify job details page.
-
After filling in your desired fields on the Specify job details page, select Next to go to the Configure job - optional page. Here you'll find the Content removal panel with the PII redaction toggle.
-
Once you select PII redaction, you have the option to select all PII types you want to redact. You can also choose to have an unredacted transcript if you select Include unredacted transcript in job output box.
-
Select Create job to run your transcription job.
This example uses the start-transcription-jobcontent-redaction
parameter. For more information, see StartTranscriptionJob
and
ContentRedaction
.
aws transcribe start-transcription-job \ --region
us-west-2
\ --transcription-job-namemy-first-transcription-job
\ --media MediaFileUri=s3://DOC-EXAMPLE-BUCKET
/my-input-files
/my-media-file
.flac
\ --output-bucket-nameDOC-EXAMPLE-BUCKET
\ --output-keymy-output-files
/ \ --language-codeen-US
\ --content-redaction RedactionType=PII
,RedactionOutput=redacted
,PiiEntityTypes=NAME
,ADDRESS
,BANK_ACCOUNT_NUMBER
Here's another example using the start-transcription-job
aws transcribe start-transcription-job \ --region
us-west-2
\ --cli-input-json file://filepath
/my-first-redaction-job
.json
The file my-first-redaction-job.json contains the following request body.
{ "TranscriptionJobName": "
my-first-transcription-job
", "Media": { "MediaFileUri": "s3://DOC-EXAMPLE-BUCKET
/my-input-files
/my-media-file
.flac
" }, "OutputBucketName": "DOC-EXAMPLE-BUCKET
", "OutputKey": "my-output-files
/", "LanguageCode": "en-US
", "ContentRedaction": { "RedactionOutput":"redacted
", "RedactionType":"PII", "PiiEntityTypes": [ "NAME
", "ADDRESS
", "BANK_ACCOUNT_NUMBER
" ] } }
This example uses the AWS SDK for Python (Boto3) to redact content using the ContentRedaction
argument for the
start_transcription_jobStartTranscriptionJob
and
ContentRedaction
.
For additional examples using the AWS SDKs, including feature-specific, scenario, and cross-service examples, refer to the Code examples for Amazon Transcribe using AWS SDKs chapter.
from __future__ import print_function import time import boto3 transcribe = boto3.client('transcribe', '
us-west-2
') job_name = "my-first-transcription-job
" job_uri = "s3://DOC-EXAMPLE-BUCKET
/my-input-files
/my-media-file
.flac
" transcribe.start_transcription_job( TranscriptionJobName = job_name, Media = { 'MediaFileUri': job_uri }, OutputBucketName = 'DOC-EXAMPLE-BUCKET
', OutputKey = 'my-output-files
/', LanguageCode = 'en-US
', ContentRedaction = { 'RedactionOutput':'redacted
', 'RedactionType':'PII', 'PiiEntityTypes': [ 'NAME
','ADDRESS
','BANK_ACCOUNT_NUMBER
' ] } ) while True: status = transcribe.get_transcription_job(TranscriptionJobName = job_name) if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']: break print("Not ready yet...") time.sleep(5) print(status)
Note
PII redaction for batch jobs is only supported in these AWS Regions: Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), GovCloud (US-West), Canada (Central), EU (Frankfurt), EU (Ireland), EU (London), EU (Paris), Middle East (Bahrain), South America (Sao Paulo), US East (N. Virginia), US East (Ohio), US West (Oregon), and US West (N. California).