Pig application specifics for earlier AMI versions of Amazon EMR - Amazon EMR

Pig application specifics for earlier AMI versions of Amazon EMR

Supported Pig versions

The Pig version you can add to your cluster depends on the version of the Amazon EMR AMI and the version of Hadoop you are using. The table below shows which AMI versions and versions of Hadoop are compatible with the different versions of Pig. We recommend using the latest available version of Pig to take advantage of performance enhancements and new functionality.

When you use the API to install Pig, the default version is used unless you specify --pig-versions as an argument to the step that loads Pig onto the cluster during the call to RunJobFlow.

Pig version AMI version Configuration parameters Pig version details
0.12.0

Release notes

Documentation

3.1.0 and later

--ami-version 3.1

--ami-version 3.2

--ami-version 3.3

Adds support for the following:

  • Streaming UDFs without JVM implementations

  • ASSERT and IN operators

  • CASE expression

  • AvroStorage as a Pig built-in function.

  • ParquetLoader and ParquetStorer as built-in functions

  • BigInteger and BigDecimal types

0.11.1.1

Release notes

Documentation

2.2 and later

--pig-versions 0.11.1.1

--ami-version 2.2

Improves performance of LOAD command with PigStorage if input resides in Amazon S3.

0.11.1

Release notes

Documentation

2.2 and later

--pig-versions 0.11.1

--ami-version 2.2

Adds support for JDK 7, Hadoop 2, Groovy user-defined functions, SchemaTuple optimization, new operators, and more. For more information, see Pig 0.11.1 change log.

0.9.2.2

Release notes

Documentation

2.2 and later

--pig-versions 0.9.2.2

--ami-version 2.2

Adds support for Hadoop 1.0.3.

0.9.2.1

Release notes

Documentation

2.2 and later

--pig-versions 0.9.2.1

--ami-version 2.2

Adds support for MapR.

0.9.2

Release notes

Documentation

2.2 and later

--pig-versions 0.9.2

--ami-version 2.2

Includes several performance improvements and bug fixes. For complete information about the changes for Pig 0.9.2, go to the Pig 0.9.2 change log.

0.9.1

Release notes

Documentation

2.0

--pig-versions 0.9.1

--ami-version 2.0

0.6

Release notes

1.0

--pig-versions 0.6

--ami-version 1.0

0.3

Release notes

1.0

--pig-versions 0.3

--ami-version 1.0

Pig version details

Amazon EMR supports certain Pig releases that might have additional Amazon EMR patches applied. You can configure which version of Pig to run on Amazon EMR clusters. For more information about how to do this, see Apache Pig. The following sections describe different Pig versions and the patches applied to the versions loaded on Amazon EMR.

Pig patches

This section describes the custom patches applied to Pig versions available with Amazon EMR.

Pig 0.11.1.1 patches

The Amazon EMR version of Pig 0.11.1.1 is a maintenance release that improves performance of LOAD command with PigStorage if the input resides in Amazon S3.

Pig 0.11.1 patches

The Amazon EMR version of Pig 0.11.1 contains all the updates provided by the Apache Software Foundation and the cumulative Amazon EMR patches from Pig version 0.9.2.2. However, there are no new Amazon EMR-specific patches in Pig 0.11.1.

Pig 0.9.2 patches

Apache Pig 0.9.2 is a maintenance release of Pig. The Amazon EMR team has applied the following patches to the Amazon EMR version of Pig 0.9.2.

Patch Description
PIG-1429

Add the Boolean data type to Pig as a first class data type. For more information, go to https://issues.apache.org/jira/browse/PIG-1429.

Status: Committed

Fixed in Apache Pig Version: 0.10

PIG-1824

Support import modules in Jython UDF. For more information, go to https://issues.apache.org/jira/browse/PIG-1824.

Status: Committed

Fixed in Apache Pig Version: 0.10

PIG-2010

Bundle registered JARs on the distributed cache. For more information, go to https://issues.apache.org/jira/browse/PIG-2010.

Status: Committed

Fixed in Apache Pig Version: 0.11

PIG-2456

Add a ~/.pigbootup file where the user can specify default Pig statements. For more information, go to https://issues.apache.org/jira/browse/PIG-2456.

Status: Committed

Fixed in Apache Pig Version: 0.11

PIG-2623

Support using Amazon S3 paths to register UDFs. For more information, go to https://issues.apache.org/jira/browse/PIG-2623.

Status: Committed

Fixed in Apache Pig Version: 0.10, 0.11

Pig 0.9.1 patches

The Amazon EMR team has applied the following patches to the Amazon EMR version of Pig 0.9.1.

Patch Description
Support JAR files and Pig scripts in dfs

Add support for running scripts and registering JAR files stored in HDFS, Amazon S3, or other distributed file systems. For more information, go to https://issues.apache.org/jira/browse/PIG-1505.

Status: Committed

Fixed in Apache Pig Version: 0.8.0

Support multiple file systems in Pig

Add support for Pig scripts to read data from one file system and write it to another. For more information, go to https://issues.apache.org/jira/browse/PIG-1564.

Status: Not Committed

Fixed in Apache Pig Version: n/a

Add Piggybank datetime and string UDFs

Add datetime and string UDFs to support custom Pig scripts. For more information, go to https://issues.apache.org/jira/browse/PIG-1565.

Status: Not Committed

Fixed in Apache Pig Version: n/a

Interactive and batch Pig clusters

Amazon EMR enables you to run Pig scripts in two modes:

  • Interactive

  • Batch

When you launch a long-running cluster using the console or the AWS CLI, you can connect using ssh into the master node as the Hadoop user and use the Grunt shell to develop and run your Pig scripts interactively. Using Pig interactively enables you to revise the Pig script more easily than batch mode. After you successfully revise the Pig script in interactive mode, you can upload the script to Amazon S3 and use batch mode to run the script in production. You can also submit Pig commands interactively on a running cluster to analyze and transform data as needed.

In batch mode, you upload your Pig script to Amazon S3, and then submit the work to the cluster as a step. Pig steps can be submitted to a long-running cluster or a transient cluster.