Step 12: Query the Data in the Data Lake Using Amazon Redshift Spectrum - AWS Lake Formation

Step 12: Query the Data in the Data Lake Using Amazon Redshift Spectrum

You can set up Amazon Redshift Spectrum to query the data that you imported into your Amazon Simple Storage Service (Amazon S3) data lake. First, create an AWS Identity and Access Management (IAM) role that is used to launch the Amazon Redshift cluster and to query the Amazon S3 data. Then, grant this role the Select permissions on the tables that you want to query. Then, grant the user permissions to use the Amazon Redshift query editor. Finally, create an Amazon Redshift cluster and run queries.

You create the cluster as an administrator, and query the cluster as a data analyst.

For more information about Amazon Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data in the Amazon Redshift Database Developer Guide.

To set up permissions to run Amazon Redshift queries

  1. Open the IAM console at https://console.aws.amazon.com/iam/. Sign in as the IAM administrator user that you created in Create an Administrator IAM User (user name Administrator) or as an IAM user with the AdministratorAccess AWS managed policy.

  2. In the navigation pane, choose Policies.

    If this is your first time choosing Policies, the Welcome to Managed Policies page appears. Choose Get Started.

  3. Choose Create policy.

  4. Choose the JSON tab.

  5. Paste in the following JSON policy document.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "glue:GetTable", "glue:GetTables", "glue:SearchTables", "glue:GetDatabase", "glue:GetDatabases", "glue:GetPartitions" ], "Resource": "*" } ] }
  6. When you are finished, choose Review to review the policy. The policy validator reports any syntax errors.

  7. On the Review policy page, enter the Name as RedshiftLakeFormationPolicy for the policy that you are creating. Enter a Description (optional). Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.

  8. In the navigation pane of the IAM console, choose Roles, and then choose Create role.

  9. For Select type of trusted entity, choose AWS service.

  10. Choose the Amazon Redshift service to assume this role.

  11. Choose the Redshift Customizable use case for your service. Then choose Next: Permissions.

  12. Search for the permissions policy that you created, RedshiftLakeFormationPolicy, and select the check box next to the policy name in the list.

  13. Choose Next: Tags.

  14. Choose Next: Review.

  15. For Role name, enter the name RedshiftLakeFormationRole.

  16. (Optional) For Role description, enter a description for the new role.

  17. Review the role, and then choose Create role.

To grant Select permissions on the table to be queried in the Lake Formation database

  1. Open the Lake Formation console at https://console.aws.amazon.com/lakeformation/. Sign in as the data lake administrator.

  2. In the navigation pane, under Permissions, choose Data permissions, and then choose Grant.

  3. Provide the following information:

    • For IAM users and roles, choose the IAM role you created, RedshiftLakeFormationRole. When you run the Amazon Redshift Query Editor, it uses this IAM role for permission to the data.

    • For Database, choose lakeformation_tutorial.

      The tables list populates.

    • For Table, choose a table within the data source to query.

    • Choose the Select table permission.

  4. Choose Grant.

To set up Amazon Redshift Spectrum and run queries (new Amazon Redshift console)

Note

The following instructions are for the new Amazon Redshift console.

  1. Open the Amazon Redshift console at https://console.aws.amazon.com/redshift. Sign in as the user Administrator.

  2. Choose Create cluster.

  3. On the Create cluster page, under DC2, select dc2.large.

  4. Scroll down, and under Cluster details, enter or accept these parameters:

    Cluster identifier redshift-lakeformation-demo
    Database port 5439
    Master user name awsuser
    Master user password (Choose a password)
  5. Expand Cluster permissions, and for Available IAM roles, choose RedshiftLakeFormationRole. Then choose Add IAM role.

  6. Choose Create cluster.

    The Clusters page loads.

  7. Wait until the cluster status becomes Available. Choose the refresh icon periodically.

  8. Grant the data analyst permission to run queries against the cluster. To do so, complete the following steps.

    1. Open the IAM console at https://console.aws.amazon.com/iam/, and sign in as the Administrator user.

    2. In the navigation pane, choose Users, and attach the following managed policies to the user datalake_user.

      • AmazonRedshiftQueryEditor

      • AmazonRedshiftReadOnlyAccess

  9. Sign out of the Amazon Redshift console and sign back in as user datalake_user.

  10. In the left vertical toolbar, choose the EDITOR icon to open the query editor and connect to the cluster. If the Connect to database dialog box appears, choose the cluster name redshift-lakeformation-demo, and enter the database name dev, the user name awsuser, and the password that you created. Then choose Connect to database.

    Note

    If you are not prompted for connection parameters and another cluster is already selected in the query editor, choose Change Connection to open the Connect to database dialog box.

  11. In the New Query 1 text box, enter and run the following statement to map the database lakeformation_tutorial in Lake Formation to the Amazon Redshift schema name redshift_jdbc:

    Important

    Replace <account-id> with a valid AWS account number, and <region> with a valid AWS Region name (for example, us-east-1).

    create external schema if not exists redshift_jdbc from DATA CATALOG database 'lakeformation_tutorial' iam_role 'arn:aws:iam::<account-id>:role/RedshiftLakeFormationRole' region '<region>';
  12. In the schema list under Select schema, choose redshift_jdbc.

    The tables list populates. The query editor shows only the tables on which you were granted Lake Formation data permissions.

  13. On the pop-up menu next to a table name, choose Preview data.

    Amazon Redshift returns the first 10 rows.

    You can now run queries against the tables and columns for which you have permissions.

To set up Amazon Redshift Spectrum and run queries (original Amazon Redshift console)

Note

The following instructions are for the original Amazon Redshift console.

  1. Open the Amazon Redshift console at https://console.aws.amazon.com/redshift/. Sign in as the user Administrator.

  2. Choose Quick launch cluster.

  3. On the Quick launch page, enter or accept these parameters:

    Cluster identifier redshift-lakeformation-demo
    Database name dev
    Database port 5439
    Master user name awsuser
    Master user password (Choose and confirm a password)
    Available IAM roles RedshiftLakeFormationRole

    For more information, see Managing Clusters Using the Console in the Amazon Redshift Cluster Management Guide.

  4. Choose Launch cluster.

  5. Wait until the cluster status becomes available.

    To view cluster status, in the navigation pane, choose Clusters and check the Cluster Status column. Choose the refresh icon periodically.

  6. Grant the data analyst permission to run queries against the cluster. To do so, complete the following steps.

    1. Open the IAM console at https://console.aws.amazon.com/iam/, and sign in as the Administrator user.

    2. In the navigation pane, choose Users, and attach the following managed policies to the user datalake_user.

      • AmazonRedshiftReadOnlyAccess

      • AmazonRedshiftQueryEditor

  7. Sign out of Amazon Redshift console and sign back in as user datalake_user.

  8. Open the query editor and connect to the cluster. If prompted for credentials, choose the cluster name redshift-lakeformation-demo, and enter the database name dev, the user name awsuser, and the password that you created. Then choose Connect.

    Note

    If you are not prompted for credentials and another cluster is already selected in the query editor, choose the cluster name to open the credentials dialog box.

  9. In the New Query 1 text box, enter and run the following statement to map the database lakeformation_tutorial in Lake Formation to the Amazon Redshift schema name redshift_jdbc:

    Important

    Replace <account-id> with a valid AWS account number, and <region> with a valid AWS Region name (for example, us-east-1).

    create external schema if not exists redshift_jdbc from DATA CATALOG database 'lakeformation_tutorial' iam_role 'arn:aws:iam::<account-id>:role/RedshiftLakeFormationRole' region '<region>';
  10. In the Schema list, choose redshift_jdbc.

    The tables list populates. The query editor shows only the tables on which you were granted Lake Formation data permissions.

  11. Choose the eye icon next to one of the tables.

    Amazon Redshift returns the first 10 rows.

    You can now run queries against the tables and columns for which you have permissions.