Logging - Amazon EMR

Amazon EMR Serverless is in preview release and is subject to change. To use EMR Serverless in preview, follow the sign up steps at https://pages.awscloud.com/EMR-Serverless-Preview.html. The only Region that EMR Serverless currently supports is us-east-1, so make sure to set all region parameters to this value. All Amazon S3 buckets used with EMR Serverless must also be created in us-east-1.

Logging

When you submit job runs to your EMR Serverless application, you can enable logging. This section describes how to configure logging during job submission and how to launch the Spark history server locally using Docker to view logs on the Spark UI after your job has completed.

Prerequisites

  1. You must have the docker command installed locally. For information about how to install Docker, see the Docker Engine community.

  2. You must have access to an S3 bucket to store your logs. To set up an S3 bucket, use the s3MonitoringConfiguration configuration when starting a job run. This can be done by providing the following --configuration-overides configuration.

    { "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://DOC-EXAMPLE-BUCKET-LOGGING/logs/" } } }

    After you provision your application and run jobs, your logs can be found in s3://DOC-EXAMPLE-BUCKET-LOGGING/logs/applications/application_id/jobs/job_id. From there, you'll find driver, executer, and event logs in specific folders.

    For Spark, you can find logs in the following folders:

    • Driver logs ‐ /SPARK_DRIVER

    • Executer logs ‐ /SPARK_EXECUTER

    • Spark event logs ‐ /sparklogs

    For Hive, you can find logs in the following folders:

    • Driver logs ‐ /HIVE_DRIVER

    • Executer logs ‐ /TEZ_TASK

    • Event logs ‐ /timeline-data

Start the Spark history server and view the Spark UI locally using Docker

  1. Save both pom.xml and Dockerfile locally.

    In the pom.xml file, update the AWS SDK version to match your Amazon EMR release version. Your pom.xml file might look like the following example.

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.amazonaws</groupId> <artifactId>EMRServerlessSparkHistoryServer</artifactId> <packaging>jar</packaging> <version>2.0-SNAPSHOT</version> <name>EMRServerlessSparkHistoryServer</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <jdk.version>1.8</jdk.version> <hadoop.version>3.2.1</hadoop.version> <awssdk.version>1.11.977</awssdk.version> <httpclient.version>4.5.13</httpclient.version> <jackson.version>2.10.5</jackson.version> <jackson.databind.version>2.10.5.1</jackson.databind.version> </properties> <dependencyManagement> <dependencies> <dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-java-sdk-bom</artifactId> <version>${awssdk.version}</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> <dependencies> <dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-java-sdk-s3</artifactId> </dependency> </dependencies> </project>

    In the Dockerfile, update the second FROM line with the Amazon EMR release and Region that you are using. To choose your base image URI, see How to select a base image URI. The following Dockerfile is a sample that you should modify to meet your requirements.

    FROM maven:3.6-amazoncorretto-8 FROM 755674844232.dkr.ecr.us-east-1.amazonaws.com/spark/emr-6.3.0 USER root WORKDIR /tmp/ ADD pom.xml /tmp COPY --from=0 /usr/share/maven /usr/share/maven RUN /usr/share/maven/bin/mvn dependency:tree RUN /usr/share/maven/bin/mvn dependency:copy-dependencies -DoutputDirectory=/usr/lib/spark/jars/ RUN mkdir /mnt/s3 \ && chown spark:spark /mnt/s3 USER spark:spark ENV SPARK_NO_DAEMONIZE=true ENTRYPOINT [ "/usr/lib/spark/sbin/start-history-server.sh" ]
  2. Build the Docker image using the files in the local directory, using the name emr/sparkui.

    docker build -t emr/sparkui .
  3. Define your Amazon S3 log location as an environment variable.

    LOG_DIR="s3://<DOC-EXAMPLE-BUCKET/logs/applications/application_id/jobs/job_id/sparklogs/"
  4. Define AWS access credentials as environment variables and define the Region you’re running your job in.

    export AWS_ACCESS_KEY_ID=AKIAAAABBBCCCDDD export AWS_SECRET_ACCESS_KEY=abcd1234 export AWS_REGION=us-west-2
  5. Create and start the Docker container. In the following commands, use the values obtained in the previous step.

    docker run --rm \ --user spark \ -p 18080:18080 \ -e SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=$LOG_DIR -Dspark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" \ -e AWS_REGION -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \ emr/sparkui
  6. Open http://localhost:18080 in your browser to view the Spark UI locally.

Start the application timeline server and view the Tez UI locally using Docker

  1. Save pom.xml, Dockerfile , entrypoint.sh, event-log-sync.sh, hadoop-layout.sh, and yarn-site.xml locally.

    Your pom.xml file might look like the following example.

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.amazonaws</groupId> <artifactId>TezUI</artifactId> <packaging>jar</packaging> <version>2.0-SNAPSHOT</version> <name>TezUI</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>joda-time</groupId> <artifactId>joda-time</artifactId> <version>2.9.3</version> <exclusions> <exclusion> <groupId>*</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.apache.tez</groupId> <artifactId>tez-yarn-timeline-cache-plugin</artifactId> <version>0.10.0</version> <exclusions> <exclusion> <groupId>*</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-runner</artifactId> <version>9.3.27.v20190418</version> <exclusions> <exclusion> <groupId>*</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> </dependencies> </project>

    The following Dockerfile is a sample that you should modify to meet your requirements. Make sure that ports 8088 (Hadoop Resource Manager), 8188 (Hadoop Application Timeline Server), 9999 (Tez UI) are not occupied.

    FROM amazonlinux:2 FROM amazoncorretto:8 FROM maven:3.6-amazoncorretto-8 RUN yum install -y procps awscli rsync WORKDIR /tmp/ ENV ENTRYPOINT /usr/bin/entrypoint.sh ENV TEZ_HOME /hadoop/usr/lib/tez ENV YARN_HOME /hadoop/usr/lib/hadoop-yarn ENV HADOOP_HOME /hadoop/usr/lib/hadoop ENV HDFS_HOME /hadoop/usr/lib/hadoop-hdfs ENV TEZ_HOME /hadoop/usr/lib/tez ENV HADOOP_CONF /hadoop/etc/hadoop/conf RUN curl -o ./apache-tez-0.9.2-bin.tar.gz https://archive.apache.org/dist/tez/0.9.2/apache-tez-0.9.2-bin.tar.gz && \ curl -o ./hadoop-2.10.1.tar.gz https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz && \ tar -xzf hadoop-2.10.1.tar.gz && \ tar -xzf apache-tez-0.9.2-bin.tar.gz RUN mkdir -p $HADOOP_HOME/lib && \ mkdir -p $TEZ_HOME && \ mkdir -p $HADOOP_CONF && \ mkdir -p $YARN_HOME && \ mkdir -p $HDFS_HOME &&\ mkdir -p /tmp/tez-ui COPY hadoop-layout.sh $HADOOP_HOME/libexec/hadoop-layout.sh COPY yarn-site.xml . COPY pom.xml . RUN mvn dependency:copy-dependencies -DoutputDirectory=/tmp/tez-ui/ && \ cp /tmp/tez-ui/joda-time-2.9.3.jar $HADOOP_HOME/lib/ && \ cp /tmp/tez-ui/jetty-runner-*.jar $TEZ_HOME && \ cp /tmp/tez-ui/tez-yarn-timeline-cache-plugin*.jar $TEZ_HOME COPY event-log-sync.sh . COPY entrypoint.sh /usr/bin/entrypoint.sh RUN chmod 744 $ENTRYPOINT ENTRYPOINT [ "/usr/bin/entrypoint.sh" ] EXPOSE 8088 EXPOSE 8188 EXPOSE 9999 CMD tail -f /dev/null

    Your yarn-site.xml should look like the following.

    <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <property> <name>yarn.timeline-service.hostname</name> <value>0.0.0.0</value> </property> <property> <name>yarn.timeline-service.bind-host</name> <value>0.0.0.0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>0.0.0.0</value> </property> <property> <name>yarn.resourcemanager.bind-host</name> <value>0.0.0.0</value> </property> <property> <name>yarn.timeline-service.http-cross-origin.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.http-cross-origin.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.version</name> <value>1.5</value> </property> <property> <name>yarn.timeline-service.store-class</name> <value>org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore</value> </property> <property> <name>APPLICATION_ID</name> <value>APPLICATION_ID</value> </property> <property> <name>JOB_RUN_ID</name> <value>JOB_RUN_ID</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.active-dir</name> <value>file:////tmp/timeline-data/${JOB_RUN_ID}/active</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.done-dir</name> <value>file:////tmp/timeline-data/${JOB_RUN_ID}/done</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.group-id-plugin-classes</name> <value>org.apache.tez.dag.history.logging.ats.TimelineCachePluginImpl</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.summary-store</name> <value>org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore</value> </property> <property> <name>yarn.timeline-service.ttl-enable</name> <value>false</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.scan-interval-seconds</name> <value>10</value> </property> </configuration>

    Your hadoop-layout.sh file should look like the following.

    # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. HADOOP_COMMON_DIR="./" HADOOP_COMMON_LIB_JARS_DIR="lib" HADOOP_COMMON_LIB_NATIVE_DIR="lib/native" HDFS_DIR="./" HDFS_LIB_JARS_DIR="lib" YARN_DIR="./" YARN_LIB_JARS_DIR="lib" MAPRED_DIR="./" MAPRED_LIB_JARS_DIR="lib" HADOOP_LIBEXEC_DIR=${HADOOP_LIBEXEC_DIR:-"/usr/lib/hadoop/libexec"} HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop/conf"} HADOOP_COMMON_HOME=${HADOOP_COMMON_HOME:-"/usr/lib/hadoop"} HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-"/usr/lib/hadoop-hdfs"} HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-"/usr/lib/hadoop-mapreduce"} HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/lib/hadoop-yarn"}

    Your event-log-sync.sh should look like the following. Timeline data is regularly downloaded to a container’s local disk using this script.

    job_path=/tmp/timeline-data/$JOB_RUN_ID mkdir -p $job_path/active mkdir -p $job_path/done while [[ true ]]; do aws s3 sync $S3_LOG_URI/applications/$APPLICATION_ID/jobs/$JOB_RUN_ID/timeline-data/active $job_path/active --exclude '*$folder$*' # Hack to move done events to active folder as ATS doesnt read done path aws s3 sync $S3_LOG_URI/applications/$APPLICATION_ID/jobs/$JOB_RUN_ID/timeline-data/done $job_path/done_events/ --exclude '*$folder$*' if [ -d "$job_path/done_events/*/*/*/*" ]; then rsync -r $job_path/done_events/*/*/*/* $job_path/active/ fi echo " `date +%s` sleeping for 30 seconds" sleep 30s done

    Your entrypoint.sh should look like the following.

    #!/bin/bash if [ -z "$JOB_RUN_ID" ]; then echo "JOB_RUN_ID is not set" exit fi if [ -z "$APPLICATION_ID" ]; then echo "APPLICATION_ID is not set" exit fi if [ -z "$S3_LOG_URI" ]; then echo "S3_LOG_URI is not set" exit fi function cpp(){ [ -d $2 ] || mkdir $2 cp -r $1 $2 } cpp "hadoop-2.10.1/share/hadoop/hdfs/*.jar" /hadoop/usr/lib/hadoop-hdfs/ cpp "hadoop-2.10.1/share/hadoop/hdfs/*.jar" /hadoop/usr/lib/hadoop-hdfs/ cpp "hadoop-2.10.1/share/hadoop/hdfs/lib/*.jar" /hadoop/usr/lib/hadoop-hdfs/ cpp "hadoop-2.10.1/share/hadoop/common/lib/*.jar" /hadoop/usr/lib/hadoop/ cpp "hadoop-2.10.1/share/hadoop/common/*.jar" /hadoop/usr/lib/hadoop/ cpp "hadoop-2.10.1/share/hadoop/yarn/lib/*.jar" /hadoop/usr/lib/hadoop-yarn cpp "hadoop-2.10.1/share/hadoop/yarn/*.jar" /hadoop/usr/lib/hadoop-yarn cpp hadoop-2.10.1/share/hadoop/yarn/timelineservice/ /hadoop/usr/lib/hadoop-yarn/ cpp "apache-tez-0.9.2-bin/*" /hadoop/usr/lib/tez/ rm -rf $TEZ_HOME/lib/slf4j-log4j12-* cpp hadoop-2.10.1/bin/yarn /hadoop/usr/lib/hadoop-yarn/bin cp hadoop-2.10.1/etc/hadoop/* /hadoop/etc/hadoop/conf/ cp yarn-site.xml /hadoop/etc/hadoop/conf/ cpp hadoop-2.10.1/sbin/yarn-daemon.sh /hadoop/usr/lib/hadoop-yarn/sbin/ cpp hadoop-2.10.1/libexec /hadoop/usr/lib/hadoop/ bash event-log-sync.sh > event-log-sync.log & export HADOOP_BASE_PATH=/hadoop export HADOOP_COMMON_HOME=$HADOOP_BASE_PATH/usr/lib/hadoop export HADOOP_LIBEXEC_DIR=$HADOOP_BASE_PATH/usr/lib/hadoop/libexec export HADOOP_YARN_HOME=$HADOOP_BASE_PATH/usr/lib/hadoop-yarn export HADOOP_HDFS_HOME=$HADOOP_BASE_PATH/usr/lib/hadoop-hdfs export HADOOP_MAPRED_HOME=$HADOOP_BASE_PATH/usr/lib/hadoop export HADOOP_CONF_DIR=$HADOOP_BASE_PATH/etc/hadoop/conf export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_BASE_PATH/usr/lib/tez/*:$HADOOP_BASE_PATH/usr/lib/tez/lib/*:$HADOOP_BASE_PATH/usr/share/aws/aws-java-sdk/* export TEZ_HOME=$HADOOP_BASE_PATH/usr/lib/tez export PATH=$HADOOP_CLASSPATH:$PATH:$HADOOP_BASE_PATH/usr/lib/hadoop/bin export USER=`id -u -n` t=JOB_RUN_ID sed -i "s#<value>$t</value>#<value>$JOB_RUN_ID</value>#" /hadoop/etc/hadoop/conf/yarn-site.xml t=APPLICATION_ID sed -i "s#<value>$t</value>#<value>$APPLICATION_ID</value>#" /hadoop/etc/hadoop/conf/yarn-site.xml bash $HADOOP_BASE_PATH/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh start resourcemanager sleep 5s bash $HADOOP_BASE_PATH/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh start timelineserver rm -rf hadoop-2.10.1* apache-tez-0.9.2-bin* mkdir -p /hadoop/usr/lib/tez/logs/ java -jar $TEZ_HOME/jetty-runner-*.jar --port 9999 --path /tez-ui/ $TEZ_HOME/tez-ui*.war > $TEZ_HOME/logs/tez-ui.log & echo "*****************************************************************" echo "Launching Tez UI. Access it using: http://localhost:9999/tez-ui/" echo "*****************************************************************" tail -f /dev/null
  2. Build the Docker image using the files in the local directory, using the name emr/tezui.

    docker build -t emr/tezui .
  3. Define your Amazon S3 log location as an environment variable.

    export S3_LOG_URI="s3://<DOC-EXAMPLE-BUCKET/logs/applications/application_id/jobs/job_id/tezlogs/"
  4. Define AWS access credentials as environment variables, and define the Region you’re running your job in.

    export AWS_ACCESS_KEY_ID=AKIAAAABBBCCCDDD export AWS_SECRET_ACCESS_KEY=abcd1234 export AWS_REGION=us-east-1
  5. Define the application ID and job run ID that you want to monitor. EMR Serverless currently support one job per Docker container.

    export APPLICATION_ID=<application_id> export JOB_RUN_ID=<job_run_id>
  6. Create and start the Docker container. In the following command, use the values obtained in the previous step.

    docker run -p 8088:8088 -p 8188:8188 -p 9999:9999 \ -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_REGION \ -e S3_LOG_URI -e JOB_RUN_ID -e APPLICATION_ID -it \ emr/tezui
  7. Open http://localhost:9999/tez-ui/ in your browser to view the Tez UI locally.

  8. To view the next job run, you can stop the container and repeat step 5 and step 6, providing the next job's ID. If you do so, remember to redefine your Amazon S3 log location if your S3 path contains a job ID.