Using Data from an Amazon RDS Database to Create an Amazon ML Datasource - Amazon Machine Learning

We are no longer updating the Amazon Machine Learning service or accepting new users for it. This documentation is available for existing users, but we are no longer updating it. For more information, see What is Amazon Machine Learning.

Using Data from an Amazon RDS Database to Create an Amazon ML Datasource

Amazon ML allows you to create a datasource object from data stored in a MySQL database in Amazon Relational Database Service (Amazon RDS). When you perform this action, Amazon ML creates an AWS Data Pipeline object that executes the SQL query that you specify, and places the output into an S3 bucket of your choice. Amazon ML uses that data to create the datasource.


Amazon ML supports only MySQL databases in VPCs.

Before Amazon ML can read your input data, you must export that data to Amazon Simple Storage Service (Amazon S3). You can set up Amazon ML to perform the export for you by using the API. (RDS is limited to the API, and is not available from the console.)

In order for Amazon ML to connect to your MySQL database in Amazon RDS and read data on your behalf, you need to provide the following:

  • The RDS DB instance identifier

  • The MySQL database name

  • The AWS Identity and Access Management (IAM) role that is used to create, activate, and execute the data pipeline

  • The database user credentials:

    • User name

    • Password

  • The AWS Data Pipeline security information:

    • The IAM resource role

    • The IAM service role

  • The Amazon RDS security information:

    • The subnet ID

    • The security group IDs

  • The SQL query that specifies the data that you want to use to create the datasource

  • The S3 output location (bucket) used to store the results of the query

  • (Optional) The location of the data schema file

Additionally, you need to ensure that the IAM users or roles that create Amazon RDS datasources by using the CreateDataSourceFromRDS operation have the iam:PassRole permission. For more information, see Controlling Access to Amazon ML Resources -with IAM.

RDS Database Instance Identifier

The RDS DB instance identifier is a unique name that you supply that identifies the database instance that Amazon ML should use when interacting with Amazon RDS. You can find the RDS DB instance identifier in the Amazon RDS console.

MySQL Database Name

MySQL Database Name specifies the name of the MySQL database in the RDS DB instance.

IAM Role Used to Create, Activate, and Execute the Data Pipeline

This parameter specifies the IAM role that Amazon ML assumes on behalf of the user to create and activate a data pipeline in the user's account and copy data (the result of the SQL query) from Amazon RDS to Amazon S3. Amazon ML then creates a datasource based on the data in Amazon S3.

An IAM role has two parts: a permissions policy (or policies) that states the permissions given to the role, and a trust policy that states who can assume the role.

The following permissions policy defines a role that gives permission to describe the RDS DB instances; describe the database security name; describe, create, modify and activate the data pipelines; and pass the resource and service roles to AWS Data Pipeline. Additionally, it allows Amazon ML to list the contents of the S3 bucket where data will be output. This bucket name must match the bucket name that is passed as a part of the Amazon S3 output location parameter. Finally, the policy grants the GetObject permission for the Amazon S3 path where the data will be output. For example, if you specify "s3://examplebucket/output/path" as the output staging location, you will need to grant the s3:ListBucket permission to the "arn:aws:s3:::examplebucket" resource, and the s3:GetObject permission to the "arn:aws:s3:::examplebucket/output/path/*" resource.

{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "rds:DescribeDBInstances", "rds:DescribeDBSecurityGroups", "datapipeline:DescribeObjects", "datapipeline:CreatePipeline", "datapipeline:PutPipelineDefinition", "datapipeline:ActivatePipeline", "datapipeline:DescribePipelines", "datapipeline:QueryObjects", "iam:PassRole" ], "Resource": [ "*" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::examplebucket" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws:s3:::examplebucket/output/path/*" ] } ]}

The following trust policy allows Amazon ML to assume the role defined in the preceding example:

{ "Version": "2008-10-17", "Statement": [{ "Sid": "", "Effect": "Allow", "Principal": { "Service": "" }, "Action": "sts:AssumeRole" } ]}

Amazon ML creates a new pipeline every time you use Amazon RDS data to create a datasource, but does not delete the pipeline after the data has been imported. AWS Data Pipeline Limits apply to these pipelines, so we recommend that you delete Amazon ML pipelines regularly.

Database User Credentials

To connect to the RDS DB instance, you must supply the user name and password of the database user who has sufficient permissions to execute the SQL query that you provide.

AWS Data Pipeline Security Information

To enable secure AWS Data Pipeline access, you must provide the names of the IAM resource role and the IAM service role.

An EC2 instance assumes the resource role to copy data from Amazon RDS to Amazon S3. The easiest way to create this resource role is by using the DataPipelineDefaultResourceRole template, and listing as a trusted service. For more information about the template, see Setting Up IAM roles in the AWS Data Pipeline Developer Guide.

AWS Data Pipeline assumes the service role to monitor the progress of copying data from Amazon RDS to Amazon S3. The easiest way to create this resource role is by using the DataPipelineDefaultRole template, and listing as a trusted service. For more information about the template, see Setting Up IAM roles in the AWS Data Pipeline Developer Guide.

Amazon RDS Security Information

To enable secure Amazon RDS access, you need to provide the VPC Subnet ID and RDS Security Group IDs. You also need to set up appropriate ingress rules for the VPC subnet that is pointed at by the Subnet ID parameter, and provide the ID of the security group that has this permission.


The MySQL SQL Query parameter specifies the SQL SELECT query that you want to execute on your MySQL database. The results of the query is copied to the S3 output location (bucket) that you specify.


Machine learning technology works best when input records are presented in random order (shuffled). You can easily shuffle the results of your MySQL SQL query by using the rand() function. For example, let's say that this is the original query:

"SELECT col1, col2, … FROM training_table"

You can add random shuffling by updating the query like this:

"SELECT col1, col2, … FROM training_table ORDER BY rand()"

S3 Output Location

The S3 Output Location parameter specifies the name of the "staging" Amazon S3 location where the results of the MySQL SQL query is output.


You need to ensure that Amazon ML has permissions to read data from this location once the data is exported from Amazon RDS. For information about setting these permissions, see Granting Amazon ML Permissions to Read Your Data from Amazon S3 .

Configuring an IAM User or Role to Grant Permission to Enable Role Passing

The IAM user or role that will create Amazon ML datasources from Amazon RDS needs permission to pass roles to other services—that is what enables it to pass the IAM role defined in the preceding example. To accomplish this, you need to configure the iam:PassRole permission. The following example policy shows how to configure it to pass any role belonging to the AWS account that makes the request. You can restrict this policy to specific roles if needed:

{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": [ "*" ] } ]}