Launching Resources for Your Pipeline into a VPC - AWS Data Pipeline

Launching Resources for Your Pipeline into a VPC

Pipelines launch Amazon EC2 instances and Amazon EMR clusters into an Amazon Virtual Private Cloud(Amazon VPC). AWS accounts created after 2013-12-04 each have default VPCs created for each region. The default configuration of the default VPC supports AWS Data Pipeline resources. You can use this VPC or create custom VPCs and use those. For production environments, we recommend that you create custom VPCs because it allows you greater control over network configurations. For more information, see Amazon VPC User Guide.

Important

If your AWS account was created before December 4, 2013, you might have the option to create EC2Resource objects for a pipeline in an EC2-Classic network rather than a VPC. We strongly recommend that you create resources for all your pipelines in VPCs. In addition, if you have existing resources in EC2-Classic, we strongly recommend that you migrate them to a VPC. For more information, see Migrating EC2Resource Objects in a Pipeline from EC2-Classic to VPCs.

The steps to configure a VPC that AWS Data Pipeline can use are listed below:

  • First, create a VPC and subnets using Amazon VPC. Configure the VPC so that instances in the VPC can access AWS Data Pipeline endpoint and Amazon S3.

  • Next, set up a security group that grants Task Runner access to your data sources.

  • Finally, specify a subnet from the VPC when you configure your instances and clusters and when you create your data sources.

For more information about VPCs, see the Amazon VPC User Guide.

Create and Configure a VPC

A VPC that you create must have a subnet, an internet gateway, and a route table for the subnet with a route to the internet gateway so that instances in the VPC can access Amazon S3. If you have a default VPC, it is already configured this way. The easiest way to create and configure your VPC is to use the VPC wizard, as shown in the following procedure.

To create and configure your VPC using the VPC wizard

  1. Open the Amazon VPC console at https://console.aws.amazon.com/vpc/.

  2. From the navigation bar, use the region selector to select the region for your VPC. You launch all instances and clusters into this VPC, so select the region that makes sense for your pipeline.

  3. Choose VPC Dashboard from the navigation pane and then choose Launch VPC Wizard.

  4. Select the first option, VPC with a Single Public Subnet Only, and then click Select.

  5. The configuration page shows the CIDR ranges and settings that you've chosen. Verify that Enable DNS hostnames is Yes. Make any other changes that you need, and then click Create VPC to create your VPC, subnet, internet gateway, and route table.

  6. After the VPC is created, choose Your VPCs in the navigation pane, and then select your VPC from the list to verify settings.

  7. On the Description tab, make sure that both DNS resolution and DNS hostnames are Enabled. For more information about DNS settings and updating DNS support for a VPC, see Using DNS in the Amazon VPC User Guide.

  8. On the Description tab, beside DHCP options set, choose the identifier to open the DHCP options set to verify the configuration.

    The list of DHCP options sets opens with your options set selected.

  9. On the Details tab, next to Options, verify the following:

    • domain-name is set to ec2.internal for the US East (N. Virginia) Region, or region.compute.internal for all other regions (for example, us-west-2.compute.internal for US West (Oregon)).

    • domain-name-servers is set AmazonProvidedDNS

    For more information, see DHCP options sets in the Amazon VPC User Guide.

As an alternative to using the Amazon VPC Wizard, you can create a VPC, subnet, internet gateway, and route table manually, see Creating a VPC and Adding an Internet Gateway to Your VPC in the Amazon VPC User Guide.

Set up Connectivity Between Resources

Security groups act as a virtual firewall for your instances to control inbound and outbound traffic. You must grant Task Runner access to your data sources.

For more information about security groups, see Security Groups for Your VPC in the Amazon VPC User Guide.

First, identify the security group or IP address used by the resource running Task Runner.

  • If your resource is of type EmrCluster, Task Runner runs on the cluster by default. We create security groups named ElasticMapReduce-master and ElasticMapReduce-slave when you launch the cluster. You need the IDs of these security groups later on.

    To get the IDs of the security groups for a cluster in a VPC

    1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

    2. In the navigation pane, click Security Groups.

    3. If you have a lengthy list of security groups, you can click the Name column to sort your security groups by name. If you don't see a Name column, click the Show/Hide Columns icon, and then click Name.

    4. Note the IDs of the ElasticMapReduce-master and ElasticMapReduce-slave security groups.

  • If your resource is of type Ec2Resource, Task Runner runs on the EC2 instance by default. Create a security group for the VPC and specify it when you launch the EC2 instance. You need the ID of this security group later on.

    To create a security group for an EC2 instance in a VPC

    1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

    2. In the navigation pane, click Security Groups.

    3. Click Create Security Group.

    4. Specify a name and description for the security group.

    5. Select your VPC from the list, and then click Create.

    6. Note the ID of the new security group.

  • If you are running Task Runner on your own computer, note its public IP address, in CIDR notation. If the computer is behind a firewall, note the entire address range of its network. You need this address later on.

Next, create rules in the resource security groups that allow inbound traffic for the data sources Task Runner must access. For example, if Task Runner must access an Amazon Redshift cluster, the security group for the Amazon Redshift cluster must allow inbound traffic from the resource.

To add a rule to the security group for an Amazon RDS database

  1. Open the Amazon RDS console at https://console.aws.amazon.com/rds/.

  2. In the navigation pane, click Instances.

  3. Click the details icon for the DB instance. Under Security and Network, click the link to the security group, which takes you to the Amazon EC2 console. If you're using the old console design for security groups, switch to the new console design by clicking the icon that's displayed at the top of the console page.

  4. From the Inbound tab, click Edit and then click Add Rule. Specify the database port that you used when you launched the DB instance. Start typing the ID of the security group or IP address used by the resource running Task Runner in Source.

  5. Click Save.

To add a rule to the security group for an Amazon Redshift cluster

  1. Open the Amazon Redshift console at https://console.aws.amazon.com/redshift/.

  2. In the navigation pane, click Clusters.

  3. Click the details icon for the cluster. Under Cluster Properties, note the name or ID of the security group, and then click View VPC Security Groups, which takes you to the Amazon EC2 console. If you're using the old console design for security groups, switch to the new console design by clicking the icon that's displayed at the top of the console page.

  4. Select the security group for the cluster.

  5. From the Inbound tab, click Edit and then click Add Rule. Specify the type, protocol, and port range. Start typing the ID of the security group or IP address used by the resource running Task Runner in Source.

  6. Click Save.

Configure the Resource

To launch a resource into a subnet of a nondefault VPC or a nondefault subnet of a default VPC, you must specify the subnet using the subnetId field when you configure the resource. If you have a default VPC and you don't specify subnetId, we launch the resource into the default subnet of the default VPC.

Example EmrCluster

The following example object launches an Amazon EMR cluster into a nondefault VPC.

{ "id" : "MyEmrCluster", "type" : "EmrCluster", "keyPair" : "my-key-pair", "masterInstanceType" : "m1.xlarge", "coreInstanceType" : "m1.small", "coreInstanceCount" : "10", "taskInstanceType" : "m1.small", "taskInstanceCount": "10", "subnetId": "subnet-12345678" }

For more information, see EmrCluster.

Example Ec2Resource

The following example object launches an EC2 instance into a nondefault VPC. Notice that you must specify security groups for an instance in a nondefault VPC using their IDs, not their names.

{ "id" : "MyEC2Resource", "type" : "Ec2Resource", "actionOnTaskFailure" : "terminate", "actionOnResourceFailure" : "retryAll", "maximumRetries" : "1", "role" : "test-role", "resourceRole" : "test-role", "instanceType" : "m1.medium", "securityGroupIds" : "sg-12345678", "subnetId": "subnet-1a2b3c4d", "associatePublicIpAddress": "true", "keyPair" : "my-key-pair" }

For more information, see Ec2Resource.

Migrating EC2Resource Objects in a Pipeline from EC2-Classic to VPCs

If you have pipeline resources in EC2-Classic, we recommend that you migrate them to use Amazon VPC. Use the following steps as guidance to migrate resources from EC2-Classic to VPC in a pipeline.

To migrate pipeline resources from EC2-Classic to VPC

  1. Identify the Ec2Resource objects in your pipelines that use EC2-Classic. These objects have a securityGroups property. In contrast, objects created in a VPC have securityGroupIDs and subnetID properties.

  2. For each object, make a note of the EC2-Classic securityGroups specified for the object. You will copy each security group's settings to a new VPC security group in the next step.

  3. Follow the steps in Migrate your resources to a VPC in the Amazon EC2 User Guide for Linux Instances. The migration steps have you set up a new VPC and copy security group settings. As you perform these steps, attend to the following:

    • Set up your VPC according to the guidelines in Create and Configure a VPC above.

    • Make a note of the Subnet ID in the VPC that you create. You will use this later when you migrate objects.

    • Make a note of security group IDs as you create them and note the corresponding EC2-Classic security groups. You will need the IDs later when you migrate objects. If you need to create new VPC security groups, see Set up Connectivity Between Resources above for guidelines.

  4. Open the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/ and edit resource objects that use EC2-Classic according to the following steps:

    Note

    As an alternative to editing a pipeline directly, you can clone the pipeline, update the clone using the remaining steps in this procedure, and then delete the original pipeline after the cloned pipeline validates. To clone a pipeline, select it from the list, choose Actions, Clone. You can then delete the original pipeline after the cloned pipeline validates and runs successfully.

    1. From the list of pipelines, select the pipeline that contains the object to migrate and then choose Actions, Edit.

    2. In the architect view for the pipeline, select the resource object that you need to migrate from the design pane on the left.

    3. Select the Resources section on the right to view settings for the resource.

    4. Make a note of the Security Groups listed for the resource. These are the EC2-Classic security group or groups that you will replace.

    5. From the Add an optional field... list, select Security Group IDs.

    6. In the Security Group IDs box that appears, type the IDs of the VPC security group or groups that corresponds to EC2-Classic security groups. Use a comma to separate multiple IDs.

    7. From the Add an optional field... list, select Subnet Id, and then type the Subnet Id associated with the VPC you want to use. For example, subnet-12345678.

    8. Choose the delete icon to the right of Security Groups to remove the EC2-Classic Security Groups.

    9. Choose Save.

      AWS Data Pipeline validates the pipeline configuration. If any validation errors occur, detail appears in the lower left of the architect view. Address any errors before running the pipeline.

    10. If the pipeline is an on-demand pipeline, the updated definition is used the next time the pipeline runs. If the pipeline is a scheduled pipeline, choose Activate. AWS Data Pipeline uses the updated definition during the next scheduled run.

    Repeat the steps above for each pipeline that uses EC2-Classic resources.

To confirm that a pipeline's resources migrated to VPC successfully, you can verify that the EC2 instances launched during pipeline execution launched in the VPC.

To confirm the migration to VPC for a pipeline

  1. From the list of pipelines, choose the Pipeline ID to open execution details, and then find the EC2 instance IDs that the pipeline launched.

  2. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  3. Choose Instances

  4. For each instance from above, select the Instance ID. On the Description tab, note the value of the VPC field to confirm that the pipeline launched instances into a VPC. Instances launched in EC2-Classic have an empty VPC field.