Creating Pipelines Using the Console Manually - AWS Data Pipeline

Creating Pipelines Using the Console Manually

You can create a pipeline using the AWS Data Pipeline architect rather than starting with a template. The example pipeline that you create in this section demonstrates using the architect to create a pipeline that copies files from one Amazon S3 bucket to another on a schedule that you specify.

Prerequisites

You must have an Amazon S3 location where the file that you copy is located and a destination Amazon S3 location to copy the file to. For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started Guide.

Create the Pipeline Definition

Complete the initial pipeline creation screen to create the pipeline definition.

To create your pipeline definition

  1. Open the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/.

  2. Choose Get started now (if this is your first pipeline) or Create new pipeline.

  3. In Name, enter a name for the pipeline (for example, CopyMyS3Data).

  4. In Description, enter a description.

  5. Choose a Source for your pipeline defintion. For this walkthrough, choose Build using architect to use the AWS Data Pipeline Architect to design the pipeline. For more information about the Build using a template option, see Creating Pipelines Using Console Templates. For more information about the Import a definition option to specify a pipeline definition file in Amazon S3 or locally, see Pipeline Definition File Syntax.

    Build using architect selection
  6. Under Schedule, leave the default selections.

  7. Under Pipeline Configuration, leave Logging enabled and enter a location in Amazon S3 where log files are saved.

  8. Under Security/Access, leave Default selected for IAM roles.

    Alternatively, if you created your own IAM roles, choose Custom and then select your roles for the Pipeline role and EC2 instance role.

  9. Optionally, under Tags enter tag keys and values to help you identify and categorize the pipeline.

  10. Choose Edit in Architect.

Define an Activity Using the AWS Data Pipeline Architect

The AWS Data Pipeline Architect allows you to select predefined activities to add to a pipeline. The architect creates a graphical representation of the pipeline flow as you define activities and the resources associated with an activity, such as data nodes, schedules, resources, and so on. A data pipeline can comprise multiple activities.

In the following procedure, you add and configure a CopyActivity that copies data between two Amazon S3 locations. You specify one Amazon S3 location as the source DataNode from which to copy and another Amazon S3 location as the destination DataNode. You also configure the schedule for the activity to run and the AWS resource that the activity uses to run.

To define a copy activity

  1. Select Add, CopyActivity.

    Choose CopyActivity.

    Under Activies, fields appear for configuring the properties and resources for the copy activity.

  2. Under Activities, configure the activity according to the following guidelines:

    For this parameter... Do this...

    Name

    Enter a name to help you identify the activity, for example, copy-myS3-data.

    Type

    This is configured by default to CopyActivity based on your earlier selection to add a CopyActivity. Leave the default.

    Input

    Select Create new: DataNode from the list. A new data node with a default name of DefaultDataNode1 is created. This is the source data node from which data is copied. You configure the details of this data node later. If you have an existing data node, you can select that.

    Output

    Select Create new: DataNode from the list. A new data node with a default name of DefaultDataNode2 is created. This is the destination data node to which data is copied. You configure the details of this data node later. If you have an existing data node, you can select that.

    Schedule

    Select Create new: Schedule from the list. A new schedule with a default name of DefaultSchedule1 is created. This schedule determines when the pipeline runs. You configure the details of this schedule later. If you have an existing schedule, you can select that.

    Add an optional field...

    Select RunsOn from the list.

    An empty list appears for a new Runs On selection.

    From the blank list, select Create new: Resource. A resource with a default name of DefaultResource1 is created. This is the AWS resource that the pipeline uses to run the activity. You configure the details of the resource later. If you have an existing resource, you can select that.

    The left pane graphically depicts the activity you configured. You can choose any of the pipeline components in this pane or expand each section in the right pane to view details and perform the following configuration tasks.

Configure the Schedule

Configure the date and time for your pipeline to run. AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.

To configure the date and time for your pipeline to run

  1. On the pipeline page, in the right pane, choose Schedules.

  2. Enter a schedule name for this activity, for example, copy-myS3-data-schedule.

  3. In Start Date Time, select the date from the calendar, and then enter the time to start the activity.

  4. In Period, enter the duration for the activity (for example, 1), and then select the period category (for example, Days).

  5. (Optional) To specify the date and time to end the activity, in Add an optional field, select End Date Time, and enter the date and time.

    To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS Data Pipeline then starts launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means that you don't have to wait an hour to see AWS Data Pipeline launch its first cluster.

Configure Data Nodes

In this step, you configure the data nodes that you created and specified as Input and Output when you configured the copy activity. After you create the data nodes, other activities that you might add to the pipeline can also use them.

To configure the input and output data nodes

  1. On the pipeline page, in the right pane, choose DataNodes or choose the individual data node from the workflow in the left pane.

  2. Configure each data node according to the following guidelines.

    For this parameter... Do this...

    Name

    Enter a name that helps you identify this node's purpose. For example, replace DefaultDataNode1 with S3LocationForCopyActivityInput and DefaultDataNode2 with S3LocationForCopyActivityOutput.

    Type

    Select S3DataNode.

    Schedule

    Select the schedule that you configured in the previous step.

    Add an optional field...

    Select File Path from the list.

    An empty list appears for a new File Path selection.

    Enter an existing file path in Amazon S3 appropriate for the data node that you're configuring. For example, if you are configuring the data node specified as the Input data node for the copy activity, you might enter s3://mybucket/myinputdata; if you are configuring the Output data node, you might enter s3://mybucket/mycopy.

Configure Resources

In this step, you configure the resource that AWS Data Pipeline uses to perform the copy activity, which you specified as the Runs On resource when you configured the activity. The copy activity uses an Amazon EC2 instance.

To configure an EC2 instance as the resource for your pipeline copy activity

  1. On the pipeline page, in the right pane, choose Resources.

  2. Configure the resource according to the following guidelines.

    For this parameter... Do this...

    Name

    Enter a name for the resource that help you identify it, for example, Ec2InstanceForCopyActivity.

    Type

    Select Ec2Resource.

    Resource Role

    Leave the default DataPipelineDefaultResource selected, or select a custom IAM role. For more information, see IAM Roles for AWS Data Pipeline and IAM Roles in the IAM User Guide.

    Schedule

    Make sure that the schedule that you created above is selected.

    Role

    Leave the default DataPipelineDefaultRole selected, or select a custom IAM role. For more information, see IAM Roles for AWS Data Pipeline and IAM Roles in the IAM User Guide.

    Add an optional field...

    Choose Subnet ID from the list.

    An new, empty field Subnet Id appears.

    Enter the subnet ID of a subnet in the VPC your are using, for example, subnet-1a2bcd34.

Validate and Save the Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or incorrect, AWS Data Pipeline generates validation errors and warnings. Warning messages are informational only, but you must fix any error messages before you can activate your pipeline.

To save and validate your pipeline

  1. Choose Save pipeline.

  2. AWS Data Pipeline validates your pipeline definition and returns either success or error or warning messages. If you get an error message, choose Close and then, in the right pane, choose Errors/Warnings.

  3. The Errors/Warnings pane lists the objects that failed validation. Choose the plus (+) sign next to the object names and look for an error message in red.

  4. When you see an error message, go to the specific object pane where you see the error and fix it. For example, if you see an error message in the DataNodes object, go to the DataNodes pane to fix the error.

  5. After you fix the errors listed in the Errors/Warnings pane, choose Save Pipeline.

  6. Repeat the process until your pipeline validates successfully.

Activate the Pipeline

Activate your pipeline to start creating and processing runs. The pipeline starts based on the schedule and period in your pipeline definition.

Important

If activation succeeds, your pipeline is running and might incur usage charges. For more information, see AWS Data Pipeline pricing. To stop incurring usage charges for AWS Data Pipeline, delete your pipeline.

To activate your pipeline

  1. Choose Activate.

  2. In the confirmation dialog box, choose Close.