Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Supported Pig Versions

The versions of Pig you can run depend on the version of the Amazon Elastic MapReduce (Amazon EMR) AMI and the version of Hadoop you are using. The table below shows which AMI versions and versions of Hadoop are compatible with the different versions of Pig. We recommend using the latest available version of Pig to take advantage of performance enhancements and new functionality. To select the configuration, use the --ami-version, --hadoop-version, and --pig-versions parameters in the cluster creation call.

The default configuration for Amazon EMR clusters launched with AMI version 2.2 and later is Hadoop 1.0.3 with Pig 0.9.2.1. The default configuration for Amazon EMR clusters launched with AMI version 1.0 is Hadoop 0.18 with Pig 0.3. For more information about the Amazon EMR AMIs and AMI versioning, see Choose a Machine Image .

The Amazon EMR console does not support Pig versioning and always launches the latest version of Pig.

The version of the Amazon EMR CLI released on 9 April 2012 is the first version to support Pig versioning. Clusters created with versions of the CLI downloaded before 9 April 2012 do not support Pig versioning and use the default configuration of Pig. Clusters created with versions of the Amazon EMR CLI downloaded on 9 April 2012 or later will use the latest version of Pig available on the AMI, unless otherwise specified using the --pig-versions parameter. You can download the latest version of the CLI from http://aws.amazon.com/code/Elastic-MapReduce/2264.

Calls to the API will launch the default configuration of Pig unless you specify --pig-versions as an argument to the step that loads Pig onto the cluster during the call to RunJobFlow.

Pig VersionHadoop VersionAMI VersionConfiguration Parameters
0.30.181.0

--pig-versions 0.3

--hadoop-version 0.18

--ami-version 1.0

0.60.201.0

--pig-versions 0.6

--hadoop-version 0.20

--ami-version 1.0

0.9.10.20.2052.0

--pig-versions 0.9.1

--hadoop-version 0.20.205

--ami-version 2.0

0.9.21.0.32.2 and later

--pig-versions 0.9.2

--hadoop-version 1.0.3

--ami-version 2.2

0.9.2.11.0.32.2 and later

--pig-versions 0.9.2.1

--hadoop-version 1.0.3

--ami-version 2.2

0.9.2.21.0.32.2 and later

--pig-versions 0.9.2.2

--hadoop-version 1.0.3

--ami-version 2.2

To specify the Pig version when creating the cluster

  • Use the --pig-versions parameter. The following command-line example creates an interactive Pig cluster running Hadoop 1.0.3 and Pig 0.9.2. In the following, instanceType would be replaced by an EC2 instance type such as m1.small.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Pig" \
      --hadoop-version 1.0.3 \
      --ami-version 2.2 \
      --num-instances 5 --instance-type instanceType \
      --pig-interactive \
      --pig-versions 0.9.2
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Pig" --hadoop-version 1.0.3 --ami-version 2.2 --num-instances 5 --instance-type instanceType --pig-interactive --pig-versions 0.9.2

To specify the latest Pig version when creating the cluster

  • Use the --pig-versions parameter with the latest keyword. The following command-line example creates an interactive Pig cluster running the latest version of Pig. In the following, instanceType would be replaced by an EC2 instance type such as m1.small.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Latest Pig" \
      --hadoop-version 1.0.3 \
      --ami-version 2.2 \
      --num-instances 5 --instance-type instanceType \
      --pig-interactive \
      --pig-versions latest
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Latest Pig" --hadoop-version 1.0.3 --ami-version 2.2 --num-instances 5 --instance-type instanceType --pig-interactive --pig-versions latest

To load multiple versions of Pig for a given cluster

  • Use the --pig-versions parameter and separate the version numbers by commas. The following command-line example creates an interactive Pig job flow running Hadoop 0.20.205 and Pig 0.9.1 and Pig 0.9.2. With this configuration, you can use either version of Pig on the cluster. In the following, instanceType would be replaced by an EC2 instance type such as m1.small.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Pig" \
      --hadoop-version 0.20.205 \
      --ami-version 2.0 \
      --num-instances 5 --instance-type instanceType \
      --pig-interactive \
      --pig-versions 0.9.1,0.9.2
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Pig" --hadoop-version 0.20.205 --ami-version 2.0 --num-instances 5 --instance-type instanceType --pig-interactive --pig-versions 0.9.1,0.9.2

If you have multiple versions of Pig loaded on a cluster, calling Pig will access the default version of Pig (currently 0.9.2), or the version loaded last if there are multiple --pig-versions parameters specified in the cluster creation call. When the comma-separated syntax is used with --pig-versions to load multiple versions, pig will access the default version of Pig.

To call a specific version of Pig

  • Add the version number to the call. For example, pig-0.9.1 or pig-0.9.2. You would do this, for example, in an interactive Pig cluster by using SSH to connect to the master node and then running a command like the following from the terminal.

    pig-0.9.1
    	  		

Pig Version Details

You can configure which version of Pig to run on Amazon Elastic MapReduce (Amazon EMR) clusters. For more information about how to do this, see Process Data with Pig. The following sections describe different Pig versions and the patches applied to the versions loaded on Amazon EMR.

New Features of Pig 0.9.2

Pig 0.9.2.2 adds support for Hadoop 1.0.3.

Pig 0.9.2.1 adds support for MapR. For more information, see Using the MapR Distribution for Hadoop.

Pig 0.9.2 includes several performance improvements and bug fixes. For complete information about the changes for Pig 0.9.2, go to the Pig 0.9.2 Change Log.

Pig 0.9.2 Patches

Apache Pig 0.9.2 is a maintenance release of Pig. The Amazon EMR team has applied the following patches to the Amazon EMR version of Pig 0.9.2.

PatchDescription
PIG-1429

Add the Boolean data type to Pig as a first class data type. For more information, go to https://issues.apache.org/jira/browse/PIG-1429.

Status: Committed

Fixed in Apache Pig Version: 0.10

PIG-1824

Support import modules in Jython UDF. For more information, go to https://issues.apache.org/jira/browse/PIG-1824.

Status: Committed

Fixed in Apache Pig Version: 0.10

PIG-2010

Bundle registered JARs on the distributed cache. For more information, go to https://issues.apache.org/jira/browse/PIG-2010.

Status: Committed

Fixed in Apache Pig Version: 0.11

PIG-2456

Add a ~/.pigbootup file where the user can specify default Pig statements. For more information, go to https://issues.apache.org/jira/browse/PIG-2456.

Status: Committed

Fixed in Apache Pig Version: 0.11

PIG-2623

Support using Amazon S3 paths to register UDFs. For more information, go to https://issues.apache.org/jira/browse/PIG-2623.

Status: Committed

Fixed in Apache Pig Version: 0.10, 0.11

Pig 0.9.1 Patches

The Amazon EMR team has applied the following patches to the Amazon EMR version of Pig 0.9.1.

PatchDescription
Support JAR files and Pig scripts in dfs

Add support for running scripts and registering JAR files stored in HDFS, Amazon S3, or other distributed file systems. For more information, go to https://issues.apache.org/jira/browse/PIG-1505.

Status: Committed

Fixed in Apache Pig Version: 0.8.0

Support multiple file systems in Pig

Add support for Pig scripts to read data from one file system and write it to another. For more information, go to https://issues.apache.org/jira/browse/PIG-1564.

Status: Not Committed

Fixed in Apache Pig Version: n/a

Add Piggybank datetime and string UDFs

Add datetime and string UDFs to support custom Pig scripts. For more information, go to https://issues.apache.org/jira/browse/PIG-1565.

Status: Not Committed

Fixed in Apache Pig Version: n/a

Additional Pig Functions

The Amazon EMR development team has created additional Pig functions that simplify string manipulation and make it easier to format date-time information. These are available at http://aws.amazon.com/code/2730.