Menu
Amazon EMR
Amazon EMR Release Guide

Apache Pig

Amazon EMR supports Apache Pig, a programming framework you can use to analyze and transform large data sets. For more information about Pig, go to http://pig.apache.org/.

Pig is an open-source, Apache library that runs on top of Hadoop. The library takes SQL-like commands written in a language called Pig Latin and converts those commands into Tez jobs based on directed acyclic graphs (DAGs) or MapReduce programs. You do not have to write complex code using a lower level computer language, such as Java.

You can execute Pig commands interactively or in batch mode. To use Pig interactively, create an SSH connection to the master node and submit commands using the Grunt shell. To use Pig in batch mode, write your Pig scripts, upload them to Amazon S3, and submit them as cluster steps. For more information on submitting work to a cluster, see Submit Work to a Cluster in the Amazon EMR Management Guide.

Note

When using an Apache Pig script to write output to an Hcatalog table in Amazon S3, disable Amazon EMR direct write within your Pig script using the SET mapred.output.direct.NativeS3FileSystem false and SET mapred.output.direct.EmrFileSystem false commands. For more information, see Using HCatalog.

Pig Release Information for This Release of Amazon EMR

Application Amazon EMR Release Label Components installed with this application

Pig 0.17.0

emr-5.10.0

emrfs, emr-ddb, emr-goodies, emr-kinesis, emr-s3-dist-cp, hadoop-client, hadoop-mapred, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, pig-client, tez-on-yarn