Setting up AWS Entity Resolution - AWS Entity Resolution

Setting up AWS Entity Resolution

Before you use AWS Entity Resolution for the first time, complete the following tasks.

Sign up for AWS

If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account
  1. Open https://portal.aws.amazon.com/billing/signup.

  2. Follow the online instructions.

    Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.

    When you sign up for an AWS account, an AWS account root user is created. The root user has access to all AWS services and resources in the account. As a security best practice, assign administrative access to an administrative user, and use only the root user to perform tasks that require root user access.

Create an administrator user

To create an administrator user, choose one of the following options.

Choose one way to manage your administrator To By You can also
In IAM Identity Center

(Recommended)

Use short-term credentials to access AWS.

This aligns with the security best practices. For information about best practices, see Security best practices in IAM in the IAM User Guide.

Following the instructions in Getting started in the AWS IAM Identity Center User Guide. Configure programmatic access by Configuring the AWS CLI to use AWS IAM Identity Center in the AWS Command Line Interface User Guide.
In IAM

(Not recommended)

Use long-term credentials to access AWS. Following the instructions in Creating your first IAM admin user and user group in the IAM User Guide. Configure programmatic access by Managing access keys for IAM users in the IAM User Guide.

Subscribe to a provider service on AWS Data Exchange

Complete the following procedure if you are using a provider service-based matching workflow or an ID mapping workflow. If you are not using a provider service-based matching workflow or an ID mapping workflow, you can skip this step.

In AWS Entity Resolution, you can choose to run a matching workflow with one of the following provider services if you have a subscription with that provider on AWS Data Exchange. Your data will be matched with a set of inputs defined by your preferred provider.

In addition, you can run an ID mapping workflow with LiveRamp if you have a subscription with that provider.

There are two ways to subscribe to a provider service:

  • Private offer – If you have an existing relationship with a provider, follow the Private products and offers procedure in the AWS Data Exchange User Guide to accept a private offer on AWS Data Exchange.

  • Bring your own subscription – If you already have an existing data subscription with a provider, follow the Bring Your Own Subscription (BYOS) offers procedure in the AWS Data Exchange User Guide to accept a BYOS offer on AWS Data Exchange.

After you have subscribed to a provider service on AWS Data Exchange, you can then create a matching workflow or an ID mapping workflow with that provider service.

For more information about how to access a provider product that contains APIs, see Accessing an API product in the in the AWS Data Exchange User Guide.

Prepare data tables

In AWS Entity Resolution, each of your input data tables contain source records. These records contain consumer identifiers such as first name, last name, email address, or phone number. These source records can be matched with other source records that you provide within the same or other input data tables. Each record must have a unique Record ID (Unique ID) and you must define it as a primary key while creating a schema mapping within AWS Entity Resolution.

Every input data table is available as an AWS Glue table backed by Amazon S3. You can use your first-party data already within Amazon S3, or import data tables from other SaaS providers into Amazon S3. After the data is uploaded to Amazon S3, you can use an AWS Glue crawler to create a data table in the AWS Glue Data Catalog. You can then use the data table as an input to AWS Entity Resolution.

Preparing your data tables involves the following steps:

Step 1: Prepare your input data

Complete the following procedure if you are using a matching workflow with a provider service. If you are not using a matching workflow with a provider service, you can skip this step.

For more information, see Subscribe to a provider service on AWS Data Exchange.

If you want to run a matching workflow with a provider service-based matching workflow or an ID mapping workflow, consult the following table to prepare your input data:

Provider service Unique ID needed? Actions
LiveRamp Yes

Ensure the following:

  • The Unique ID can be either your own pseudonymous identifier or a row ID.

  • Your data input file format and normalization is aligned with the LiveRamp guidelines.

    For more information about input file formatting guidelines for the matching workflow, see Perform Identity Resolution Through ADX in the LiveRamp documentation.

    For more information about input file formatting guidelines for the ID mapping workflow, see Perform Transcoding Through ADX in the LiveRamp documentation.

TransUnion Yes

Ensure the following:

  • A Unique ID exists for TransUnion Data Enrichment.

    Note

    Pass along attributes are allowed to persist in input and output to TransUnion. Household E keys and HHID are specific to the client namespace.

  • Phone number should be 10 digits, without any special characters such as spaces or hyphens.

  • Addresses should be split into

    • a single address line (combine address lines 1 & 2, if present)

    • city

    • zip (or zip plus4), without any special characters such as spaces or hyphens

    • state, specified as 2 letter code 3

  • Email addresses should be in plaintext.

  • First Name can be lower or upper case, nicknames are supported, but titles and suffixes should be excluded.

  • Last Name can be lower or upper case, middle initials to be excluded.

Unified ID 2.0 Yes

Ensure the following:

  • The Unique ID cannot be a hash.

  • UID2 supports both email and phone number for UID2 generation. However, if both values are present in the schema mapping, the workflow duplicates each record in the output. One record uses the email for UID2 generation and the second record uses phone number. If your data includes a mix of emails and phone numbers and you don't want this duplication of records in the output, the best approach is to create a separate workflow for each, with separate schema mappings. In this scenario, go through the steps twice—create one workflow for emails and a separate one for phone numbers.

Note

A specific email or phone number, at any specific time, results in the same raw UID2 value, no matter who made the request.

Raw UID2s are created by adding salts from salt buckets which are rotated approximately once a year, causing the raw UID2 to also be rotated with it. Different salt buckets rotate at different times throughout the year. AWS Entity Resolution currently does not keep track of rotating salt buckets and raw UID2s, so it is recommended that you regenerate the raw UID2s daily. For more information, see How often should UID2s be refreshed for incremental updates? in the UID 2.0 documentation.

Step 2: Save your input data table in a supported data format

If you already saved your input data in a supported data format, you can skip this step.

To use AWS Entity Resolution, the input data must be in a format that AWS Entity Resolution supports. AWS Entity Resolution supports the following data formats:

  • comma-separated value (CSV)

    Note

    LiveRamp only supports CSV files.

  • Parquet

Step 3: Upload your input data table to Amazon S3

If you already have your first-party data table in Amazon S3, you can skip this step.

Note

The input data must be stored in Amazon Simple Storage Service (Amazon S3) in the same AWS account and AWS Region in which you want to run the matching workflow.

To upload your input data table to Amazon S3
  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. Choose Buckets, and then choose a bucket to store your data table.

  3. Choose Upload, and then follow the prompts.

  4. Choose the Objects tab to view the prefix where your data is stored. Make a note of the name of the folder.

    You can select the folder to view the data table.

Step 4: Create an AWS Glue table

The input data in Amazon S3 must be cataloged in AWS Glue and represented as an AWS Glue table. For more information about how to create an AWS Glue table with Amazon S3 as the input, see Working with crawlers on the AWS Glue console in the AWS Glue Developer Guide.

Note

AWS Entity Resolution doesn't support partitioned tables.

In this step, you set up a crawler in AWS Glue that crawls all the files in your S3 bucket and create an AWS Glue table.

Note

AWS Entity Resolution doesn't currently support Amazon S3 locations registered with AWS Lake Formation.

To create an AWS Glue table
  1. Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/.

  2. From the navigation bar, select Crawlers.

  3. Select your S3 bucket from the list, and then choose Add crawler.

  4. On the Add crawler page, enter a Crawler name and then choose Next.

  5. Continue through the Add crawler page, specifying the details.

  6. On the Choose an IAM role page, choose Choose an existing IAM role and then choose Next.

    You can also choose Create an IAM role or have your administrator create the IAM role if needed.

  7. For Create a schedule for this crawler, keep the Frequency default (Run on demand) and then choose Next.

  8. For Configure the crawler’s output, enter the AWS Glue database and then choose Next.

  9. Review all of the details, and then choose Finish.

  10. On the Crawlers page, select the check box next to your S3 bucket and then choose Run crawler.

  11. After the crawler is finished running, on the AWS Glue navigation bar, choose Databases, and then choose your database name.

  12. On the Database page, choose Tables in {your database name}.

    1. View the tables in the AWS Glue database.

    2. To view a table's schema, select a specific table.

  13. Make a note of the AWS Glue database name and AWS Glue table name.

Create an IAM role for a console user

To create an IAM role
  1. Sign in to the IAM console (https://console.aws.amazon.com/iam/) with your administrator account.

  2. Under Access management, choose Roles.

    You can use Roles to create short-term credentials, which is recommended for increased security. You can also choose Users to create long-term credentials.

  3. Choose Create role.

  4. In the Create role wizard, for Trusted entity type, choose AWS account.

  5. Keep the option This account selected, and then choose Next.

  6. For Add permissions, choose Create Policy.

    A new tab opens.

    1. Select the JSON tab, and then add policies depending on the abilities granted to the console user. AWS Entity Resolution offers the following managed policies based on common use cases:

    2. Choose Next: Tags, add tags (optional), and then choose Next: Review.

    3. For Review policy, enter a Name and Description, and review the Summary.

    4. Choose Create policy.

      You have created a policy for a collaboration member.

    5. Go back to your original tab and under Add permissions, enter the name of the policy that you just created. (You might need to reload the page.)

    6. Select the check box next to the name of the policy that you created, and then choose Next.

  7. For Name, review, and create, enter the Role name and Description.

    1. Review Select trusted entities, enter the AWS account for the person or persons who will assume the role (if necessary).

    2. Review the permissions in Add permissions, and edit if necessary.

    3. Review the Tags, and add tags if necessary.

    4. Choose Create role.

Create a workflow job role for AWS Entity Resolution

AWS Entity Resolution uses a workflow job role to run a workflow. You can create this role using the console if you have the necessary IAM permissions. If you don't have CreateRole permissions, ask your administrator to create the role.

To create a workflow job role for AWS Entity Resolution
  1. Sign in to the IAM console at https://console.aws.amazon.com/iam/ with your administrator account.

  2. Under Access management, choose Roles.

    You can use Roles to create short-term credentials, which is recommended for increased security. You can also choose Users to create long-term credentials.

  3. Choose Create role.

  4. In the Create role wizard, for Trusted entity type, choose Custom trust policy.

  5. Copy and paste the following custom trust policy into the JSON editor.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "entityresolution.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
  6. Choose Next.

  7. For Add permissions, choose Create Policy.

    A new tab appears.

    1. Copy and paste the following policy into the JSON editor.

      Note

      The following example policy supports the permissions needed to read corresponding data resources like Amazon S3 and AWS Glue. However, you might need to modify this policy depending on how you've set up your data sources.

      Your AWS Glue resources and underlying Amazon S3 resources must be in the same AWS Region as AWS Entity Resolution.

      You don't need to grant AWS KMS permissions if your data sources aren't encrypted or decrypted.

      { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::{{input-buckets}}", "arn:aws:s3:::{{input-buckets}}/*" ], "Condition":{ "StringEquals":{ "s3:ResourceAccount":[ "{{accountId}}" ] } } }, { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::{{output-bucket}}", "arn:aws:s3:::{{output-bucket}}/*" ], "Condition":{ "StringEquals":{ "s3:ResourceAccount":[ "{{accountId}}" ] } } }, { "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:GetTable", "glue:GetPartition", "glue:GetPartitions", "glue:GetSchema", "glue:GetSchemaVersion", "glue:BatchGetPartition" ], "Resource": [ "arn:aws:glue:{{aws-region}}:{{accountId}}:database/{{input-databases}}", "arn:aws:glue:{{aws-region}}:{{accountId}}:table/{{input-database}}/{{input-tables}}", "arn:aws:glue:{{aws-region}}:{{accountId}}:catalog" ] } ] }
    2. (Optional) If the input Amazon S3 bucket is encrypted using the customer’s KMS key, add the following:

      { "Effect": "Allow", "Action": [ "kms:Decrypt" ], "Resource": [ "arn:aws:kms:{{aws-region}}:{{accountId}}:key/{{inputKeys}}" ] }
    3. (Optional) If the data being written into the output Amazon S3 bucket needs to be encrypted, add the following:

      { "Effect": "Allow", "Action": [ "kms:GenerateDataKey", "kms:Encrypt" ], "Resource": [ "arn:aws:kms:{{region}}:{{accountId}}:key/{{outputKeys}}" ] }
    4. Replace each {{user input placeholder}} with your own information.

      region AWS Region of your resources. Your AWS Glue resources, underlying Amazon S3 resources and AWS KMS resources must be in the same AWS Region as the AWS Entity Resolution.
      accountId Your AWS account ID.
      input-buckets Amazon S3 buckets which contains the underlying data objects of AWS Glue where AWS Entity Resolution will read from.
      output-buckets Amazon S3 buckets where AWS Entity Resolution will generate the output data.
      input-databases AWS Glue databases where AWS Entity Resolution will read from.
      input-tables AWS Glue tables where AWS Entity Resolution will read from.
      inputKeys Managed keys in AWS Key Management Service. If your input sources are encrypted, AWS Entity Resolution must decrypt your data using your key.
      outputKeys Managed keys in AWS Key Management Service. If you need your output sources to be encrypted, AWS Entity Resolution must encrypt the output data using your key.
  8. Go back to your original tab and under Add permissions, enter the name of the policy that you just created. (You might need to reload the page.)

  9. Select the check box next to the name of the policy that you created, and then choose Next.

  10. For Name, review, and create, enter the Role name and Description.

    Note

    The Role name must match the pattern in the passRole permissions granted to the member who can pass the workflow job role to create a matching workflow.

    For example, if you're using the AWSEntityResolutionConsoleFullAccess managed policy, remember to include entityresolution into your role name.

    1. Review Select trusted entities, and edit if necessary.

    2. Review the permissions in Add permissions, and edit if necessary.

    3. Review the Tags, and add tags if necessary.

    4. Choose Create role.

  11. The workflow job role for AWS Entity Resolution has been created.