AWS Data Pipeline
Developer Guide (API Version 2012-10-29)
Next »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

What is AWS Data Pipeline?

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.

For example, you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage Service (Amazon S3) each day and then run a weekly Amazon Elastic MapReduce (Amazon EMR) cluster over those logs to generate traffic reports.

In this example, AWS Data Pipeline would schedule the daily tasks to copy data and the weekly task to launch the Amazon EMR cluster. AWS Data Pipeline would also ensure that Amazon EMR waits for the final day's data to be uploaded to Amazon S3 before it began its analysis, even if there is an unforeseen delay in uploading the logs.

AWS Data Pipeline handles the ambiguities of real-world data management. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up.

AWS Data Pipeline In Action

The following video shows AWS Data Pipeline in action.

Accessing AWS Data Pipeline

The quickest way to get set up is to sign up for AWS Data Pipeline and use the AWS Data Pipeline console and the templates it provides to create pipelines. However, there are multiple ways to interact with AWS Data Pipeline:

  • Console — a graphical interface you can use to create and manage pipelines. With it, you fill out web forms to specify the configuration details of your pipeline components. The AWS Data Pipeline console provides several templates, which are pre-configured pipelines for common scenarios. As you keep building your pipeline, graphical representation of the components appear on the design pane. The arrows between the components indicate the connection between the components. Using the console is the easiest way to get started with AWS Data Pipeline. It creates the pipeline definition for you, and no JSON or programming knowledge is required. For more information about accessing the console, see Accessing the Console.

  • Command Line Interface (CLI) — an application you run on your local machine to connect to AWS Data Pipeline and create and manage pipelines. With it, you issue commands into a terminal window and pass in JSON files that specify the pipeline definition. Using the CLI is the best option if you prefer working from a command line. For more information, see (Optional) Install the Command Line Interface.

  • Software Development Kit (SDK) — AWS provides an SDK with functions that call AWS Data Pipeline to create and manage pipelines. With it, you can write applications that automate the process of creating and managing pipelines. Using the SDK is the best option if you want to extend or customize the functionality of AWS Data Pipeline. For more information, see Working with the API.

  • Web Service API — AWS provides a low-level interface that you can use to call the web service directly using JSON. Using the API is the best option if you want to create an custom SDK that calls AWS Data Pipeline. For more information, see Making an HTTP Request to AWS Data Pipeline.

In addition, there is the Task Runner application, which is a default implementation of a task runner. Depending on the requirements of your data management, you may need to install Task Runner on a computational resource such as a long-running Amazon EC2 instance or a physical server. For more information, see Task Runners.