Supported connections for data sources and outputs - AWS Glue DataBrew

Supported connections for data sources and outputs

This section describes connections for inputs to and outputs from a DataBrew recipe job.

You can connect to the following data sources for DataBrew recipe jobs. These include any source of data that isn't a file you're uploading directly to DataBrew. The data source that you're using might be called a database, a data warehouse, or something else. However, for the purpose of this documentation, we refer to all data providers as data sources or connections.

You can create a dataset using any of these as data sources. You can also use any of the first three, Amazon S3, AWS Glue Data Catalog, or JDBC databases supported through Amazon RDS for the output of DataBrew recipe jobs. Amazon AppFlow and AWS Data Exchange are not supported data stores for the output of DataBrew recipe jobs.

  • Amazon S3 – You can use S3 to store and protect any amount of data. To create a dataset, you specify an S3 URL where DataBrew can access a data file, for example: s3://your-bucket-name/inventory-data.csv

    DataBrew can also read all of the files in an S3 folder, which means that you can create a dataset that spans multiple files. To do this, specify an S3 URL in the form of s3://your-bucket-name/your-folder-name/.

    DataBrew supports only the following Amazon S3 storage classes: Standard, Reduced Redundancy, Standard-IA, and S3 One Zone-IA. DataBrew ignores files with other storage classes. DataBrew also ignores empty files (files containing 0 bytes). For more information about Amazon S3 storage classes, see Using Amazon S3 storage classes in the Amazon S3 Console User Guide.

  • AWS Glue Data Catalog – You can use the Data Catalog to define references to data that's stored in the AWS Cloud. With the Data Catalog, you can build connections to individual tables in the following services:

    • Amazon Redshift

    • Amazon Aurora MySQL-Compatible Edition

    • Aurora PostgreSQL-Compatible Edition

    • Amazon RDS for MySQL

    • Amazon RDS for PostgreSQL

    DataBrew recognizes all AWS Lake Formation permissions that have been applied to Data Catalog resources, so DataBrew users can only access these resources if they're authorized.

    To create a dataset, you specify a Data Catalog database name and a table name. DataBrew takes care of the other connection details.

  • Data connected using drivers, for example JDBC – You can create a dataset by connecting to data with a supported JDBC driver. For more information, see Using drivers with AWS Glue DataBrew.

    DataBrew officially supports the following data sources using Java Database Connectivity (JDBC):

    • Microsoft SQL Server

    • MySQL

    • Oracle

    • PostgreSQL

    • Amazon Redshift

    • Snowflake Connector for Spark

    The data sources can be located anywhere that you can connect to them from DataBrew. This list includes only JDBC connections that we've tested and can therefore support.

    To connect to data that requires an unlisted JDBC driver, make sure that the driver is compatible with JDK 8. To use the driver, store it in S3 in a bucket where you can access it with your IAM role for DataBrew. Then point your dataset at the driver file. For more information, see Using drivers with AWS Glue DataBrew.

  • Amazon AppFlow

    Amazon AppFlow for DataBrew datasets is in preview release and is subject to change.

    Using Amazon AppFlow, you can transfer data into Amazon S3 from third-party Software-as-a-Service (SaaS) applications such as Salesforce, Zendesk, Slack, and ServiceNow. You can then use the data to create a DataBrew dataset.

    In Amazon AppFlow, you create a connection and a flow to transfer data between your third-party application and a destination application. When using Amazon AppFlow with DataBrew, make sure that the Amazon AppFlow destination application is Amazon S3. Amazon AppFlow destination applications other than Amazon S3 don't appear in the DataBrew console. For more information on transferring data from your third-party application and creating Amazon AppFlow connections and flows, see the Amazon AppFlow documentation.

    When you choose Connect new dataset in the Datasets tab of DataBrew and click Amazon AppFlow, you see all flows in Amazon AppFlow that are configured with Amazon S3 as the destination application. To use a flow's data for your dataset, choose that flow.

    Choosing Create flow, Manage flows, and View details for Amazon AppFlow in the DataBrew console opens the Amazon AppFlow console so that you can perform those tasks.

    The following situations can arise when you select an Amazon AppFlow flow in the DataBrew console to create a dataset:

    • Data hasn't been aggregated - If the flow trigger is Run on demand or is Run on schedule with full data transfer, make sure to aggregate the data for the flow before using it to create a DataBrew dataset. Aggregating the flow combines all records in the flow into a single file. Flows with the trigger type Run on schedule with incremental data transfer, or Run on event don't require aggregation. To aggregate data in Amazon AppFlow, choose Edit flow configuration > Destination details > Additional settings > Data transfer preference.

    • Flow hasn't been run - If the run status for a flow is empty, it means one of the following:

      • If the trigger for running the flow is Run on demand, the flow has not yet been run.

      • If the trigger for running the flow is Run on event, the triggering event has not yet occurred.

      • If the trigger for running the flow is Run on schedule, a scheduled run has not yet occurred.

      Before creating a dataset with a flow, choose Run flow for that flow.

      For more information, see Amazon AppFlow flows in the Amazon AppFlow User Guide.

  • AWS Data Exchange – You can choose from hundreds of third-party data sources that are available in AWS Data Exchange. By subscribing to these data sources, you get the most up-to-date version of the data.

    To create a dataset, you specify the name of a Data Exchange data product that you're subscribed to and entitled to use.