Submit Pig Work
This section demonstrates submitting Pig work to an Amazon EMR cluster. The examples
that
follow generate a report containing the total bytes transferred, a list of the top
50 IP
addresses, a list of the top 50 external referrers, and the top 50 search terms using
Bing and Google. The Pig script is located in the Amazon S3 bucket
s3://elasticmapreduce/samples/pig-apache/do-reports2.pig
. Input data is
located in the Amazon S3 bucket s3://elasticmapreduce/samples/pig-apache/input
.
The output is saved to an Amazon S3 bucket.
For EMR 3.x or earlier versions, you must copy and modify the Pig script do-reports.pig to make it work. In your modified script, replace the following line
register file:/home/hadoop/lib/pig/piggybank.jar
with this:
register file:/usr/lib/pig/lib/piggybank.jar
Then replace this script in your own bucket in Amazon S3.
Submit Pig Work Using the Amazon EMR Console
This example describes how to use the Amazon EMR console to add a Pig step to a cluster.
To submit a Pig step
-
Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/
. -
In the Cluster List, select the name of your cluster.
-
Scroll to the Steps section and expand it, then choose Add step.
-
In the Add Step dialog:
-
For Step type, choose Pig program.
-
For Name, accept the default name (Pig program) or type a new name.
-
For Script S3 location, type the location of the Pig script. For example:
s3://elasticmapreduce/samples/pig-apache/do-reports2.pig
. -
For Input S3 location, type the location of the input data. For example:
s3://elasticmapreduce/samples/pig-apache/input
. -
For Output S3 location, type or browse to the name of your Amazon S3 output bucket.
-
For Arguments, leave the field blank.
-
For Action on failure, accept the default option (Continue).
-
-
Choose Add. The step appears in the console with a status of Pending.
-
The status of the step changes from Pending to Running to Completed as the step runs. To update the status, choose the Refresh icon above the Actions column.
Submit Pig Work Using the AWS CLI
To submit a Pig step using the AWS CLI
When you launch a cluster using the AWS CLI, use the --applications
parameter to install Pig. To submit a Pig step, use the --steps
parameter.
-
To launch a cluster with Pig installed and to submit a Pig step, type the following command, replace
myKey
with the name of your EC2 key pair, and replacemybucket
with the name of your Amazon S3 bucket.-
aws emr create-cluster --name "
Test cluster
" --release-labelemr-5.32.0
--applications Name=Pig \ --use-default-roles --ec2-attributes KeyName=myKey
--instance-typem5.xlarge
--instance-count3
\ --steps Type=PIG
,Name="Pig Program
",ActionOnFailure=CONTINUE
,Args=[-f,s3://elasticmapreduce/samples/pig-apache/do-reports2.pig
,-p,INPUT=s3://elasticmapreduce/samples/pig-apache/input
,-p,OUTPUT=s3://mybucket/pig-apache/output
]Note Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
When you specify the instance count without using the
--instance-groups
parameter, a single master node is launched, and the remaining instances are launched as core nodes. All nodes use the instance type specified in the command.Note If you have not previously created the default EMR service role and EC2 instance profile, type aws
emr create-default-roles
to create them before typing thecreate-cluster
subcommand.For more information about using Amazon EMR commands in the AWS CLI, see https://docs.aws.amazon.com/cli/latest/reference/emr.
-