Menu
Amazon Elastic MapReduce
Amazon EMR Release Guide

Amazon EMR Sandbox Applications

With Amazon EMR sandbox applications, you have early access to new software for your cluster while those applications are still in development for a generally available release. Previously, bootstrap actions were the only mechanism used to install applications that were not fully supported on Amazon EMR. Sandboxed application names are denoted with the suffix, -sandbox; if myApp were sandboxed, it would be called myApp-sandbox.

Amazon EMR sandbox applications can be installed on your cluster from the console, AWS CLI, or API. These applications are installed and configured using the native configuration API.

For most sandbox applications, documentation is very limited until they are fully supported.

Note

It is possible that installation may take longer for a sandbox application than for a fully-supported application.

Oozie (Sandbox)

Use the Apache Oozie Workflow Scheduler to manage and coordinate Hadoop jobs.

Release Information

ApplicationAmazon EMR Release LabelComponents installed with this application

Oozie-Sandbox 4.2.0

emr-4.7.2

emrfs, emr-ddb, emr-goodies, emr-kinesis, emr-s3-dist-cp, hadoop-client, hadoop-mapred, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, oozie-client, oozie-server


For more information about Apache Oozie, see http://oozie.apache.org/.

Note

Oozie examples are not installed by default. To install the examples, SSH to the master node of the cluster and run install-oozie-examples.

Presto (Sandbox)

Use Presto as a fast SQL query engine for large data sources.

Release Information

ApplicationAmazon EMR Release LabelComponents installed with this application

Presto-Sandbox 0.148

emr-4.7.2

emrfs, emr-goodies, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hive-client, hcatalog-server, mysql-server, presto-coordinator, presto-worker


For more information about Presto, go to https://prestodb.io/.

Note

  • Certain Presto properties or properties that pertain to Presto cannot be configured directly with the configuration API. You can configure log.properties and config.properties. However, the following properties cannot be configured:

    • node.properties

    • jvm.config

    For more information about these configuration files, go to the Presto documentation.

  • Presto is not configured to use EMRFS. It instead uses PrestoFS.

Sqoop (Sandbox)

Sqoop is a tool for transferring data between Amazon S3, Hadoop, HDFS, and RDBMS databases

Release Information

ApplicationAmazon EMR Release LabelComponents installed with this application

Sqoop-Sandbox 1.4.6

emr-4.7.2

emrfs, emr-ddb, emr-goodies, hadoop-client, hadoop-mapred, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, mysql-server, sqoop-client


By default, Sqoop has a MariaDB and PostgresSQL driver installed. The PostgresSQL driver installed for Sqoop will only work for PostgreSQL 8.4. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib. The following are links for various JDBC connectors:

Sqoop's supported databases are shown here: http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_supported_databases. If the JDBC connect string does not match those in this list, you will need to specify a driver.

For example, you can export to an Amazon Redshift database table with the following command (for JDBC 4.1):

sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver com.amazon.redshift.jdbc41.Driver --username master --password Mymasterpass1

You can use both the MariaDB and MySQL connection strings but if you specify the MariaDB connection string, you need to specify the driver:

sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1

If you are using Secure Socket Layer encryption to access your database, you need to use a JDBC URI like in the following Sqoop export example:

sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1

For more information about SSL encryption in RDS, see Using SSL to Encrypt a Connection to a DB Instance in the Amazon Relational Database Service User Guide.

For more information, see the Apache Sqoop documentation.

Apache Zeppelin (Sandbox)

Use Apache Zeppelin as an interactive notebook that enables interactive data exploration.

Release Information

ApplicationAmazon EMR Release LabelComponents installed with this application

Zeppelin-Sandbox 0.5.6

emr-4.7.2

emrfs, emr-goodies, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, spark-client, spark-history-server, spark-on-yarn, spark-yarn-slave, zeppelin-server


For more information about Apache Zeppelin, go to https://zeppelin.incubator.apache.org/.

Note

  • Connect to Zeppelin using the same SSH tunneling method to connect to other web servers on the master node. Zeppelin server is found at port 8890.

  • Zeppelin does not use some of the settings defined in your cluster’s spark-defaults.conf configuration file (though it instructs YARN to allocate executors dynamically if you have enabled that setting). You must set executor settings (such as memory and cores) on the Interpreter tab and then restart the interpreter for them to be used.