AWS Lake Formation
Developer Guide

The AWS Documentation website is getting a new look!
Try it now and let us know what you think. Switch to the new look >>

You can return to the original look by selecting English in the language selector above.

Importing Data Using Workflows

With AWS Lake Formation, you can import your data using workflows. A workflow is a processing construct that defines the data source and schedule to import data into your data lake. It is a container for AWS Glue crawlers, jobs, and triggers that are used to orchestrate the processes to load and update the data lake. When you create a workflow, you choose a blueprint, or template, for the type of workflow to create.

For more information about workflows, see the section entitled Workflow.

To create a workflow from a blueprint

  1. Open the AWS Lake Formation console at https://console.aws.amazon.com/lakeformation/. Sign in as the data lake administrator or as a user who has data engineer permissions. For more information, see Lake Formation Personas and Permissions Reference

  2. In the navigation pane, choose Blueprints, and then choose Use blueprint.

  3. On the Use a blueprint page, start by choosing a blueprint type.

    Lake Formation provides the following types of blueprints:

    • Database snapshot – Loads or reloads all data into the data lake from a JDBC source. You can exclude some data from the source based on an exclude pattern.

    • Incremental database – Loads only new data into the data lake from a JDBC source. You specify the individual tables in the JDBC source database to include. For each table, you choose the bookmark columns and bookmark sort order to keep track of data that has previously been loaded.

    • Log file – Bulk loads data from log file sources, including AWS CloudTrail, Elastic Load Balancing logs, and Application Load Balancer logs.

  4. Under Import source, specify the data source.

    If you are importing from a JDBC source, specify a database connection. You can import data from the MySQL, PostgreSQL, Oracle, and Microsoft SQL Server databases.

    Ensure that the role that you specify for the workflow (the "workflow role") has the required IAM permissions to access the data source. For example, to import AWS CloudTrail logs, the user must have the cloudtrail:DescribeTrails and cloudtrail:LookupEvents permissions to see the list of CloudTrail logs while creating the workflow, and the workflow role must have permissions on the CloudTrail location in Amazon S3.

  5. For the Database snapshot blueprint type, optionally identify a subset of data to import by specifying one or more exclude patterns. For more information, see Specifying What Data Is Imported.

    For the Incremental database blueprint type, specify tables to import along with bookmark columns to determine previously imported data. For more information, see Tracking Processed Data Using Job Bookmarks in the AWS Glue Developer Guide.

  6. Under Import target, specify the target database, target Amazon S3 location, and data format.

    Ensure that the workflow role has the required Lake Formation permissions on the database and Amazon S3 target location.

  7. Choose an import frequency.

    You can specify a cron expression with the Custom option.

  8. Under Import options:

    1. Enter a workflow name.

    2. For role, choose the role that you created in Create an IAM Role for Workflows.

    3. Optionally specify a table prefix. The prefix is prepended to the names of Data Catalog tables that the workflow creates.

  9. Choose Create, and wait for the console to report that the workflow was successfully created.

    Tip

    Did you get the following error message?

    User: arn:aws:iam::account-id:user/username is not authorized to perform: iam:PassRole on resource:arn:aws:iam::account-id:role/rolename...

    If so, check that you replaced account-id in the user's passrole policy with a valid AWS account number.