Installing and using kernels and libraries - Amazon EMR

Installing and using kernels and libraries

Each EMR notebook comes with a set of pre-installed libraries and kernels. You can install additional libraries and kernels in an Amazon EMR cluster if the cluster has access to the repository where the kernels and libraries are located. For example, for clusters in private subnets, you might need to configure network address translation (NAT) and provide a path for the cluster to access the public PyPI repository to install a library. For more information about configuring external access for different network configurations, see Scenarios and examples in the Amazon VPC User Guide.

Installing kernels and Python libraries on a cluster master node

With Amazon EMR release version 5.30.0 and later, excluding 6.0.0, you can install additional Python libraries and kernels on the master node of the cluster. After installation, these kernels and libraries are available to any user running an EMR notebook attached to the cluster. Python libraries installed this way are available only to processes running on the master node. The libraries are not installed on core or task nodes and are not available to executors running on those nodes.

Note

For EMR versions 5.30.1, 5.31.0, and 6.1.0, you must take additional steps in order to install kernels and libraries on the master node of a cluster.

To enable the feature, do the following:

  1. Make sure that the permissions policy attached to the service role for EMR Notebooks allows the following action:

    elasticmapreduce:ListSteps

    For more information, see Service role for EMR Notebooks.

  2. Use the AWS CLI to run a step on the cluster that sets up EMR Notebooks as shown in the following example. You must use the step name EMRNotebooksSetup. Replace us-east-1 with the Region in which your cluster resides. For more information, see Adding steps to a cluster using the AWS CLI.

    aws emr add-steps --cluster-id MyClusterID --steps Type=CUSTOM_JAR,Name=EMRNotebooksSetup,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://awssupportdatasvcs.com/bootstrap-actions/EMRNotebooksSetup/emr-notebooks-setup.sh"]

You can install kernels and libraries using pip or conda in the /emr/notebook-env/bin directory on the master node.

You can run /emr/notebook-env/bin/pip as a terminal command from within a notebook cell when you are attached to the Python3 kernel. When you use the PySpark kernel, you can use the %%sh Spark magic to invoke pip.

%%sh /emr/notebook-env/bin/pip install -u <library-name>

Using conda requires sudo access. You must connect to the master node using SSH and then run conda from the terminal. For more information, see Connect to the master node using SSH.

The following example demonstrates installing the Kotlin kernel using a terminal command while connected to the master node of a cluster:

sudo /emr/notebook-env/bin/conda install kotlin-jupyter-kernel -c jetbrains
Note

These instructions do not install kernel dependencies. If your kernel has third-party dependencies, you may need to take additional setup steps before you can use the kernel with your notebook.

Using Notebook-scoped libraries

Notebook-scoped libraries are available using clusters created using Amazon EMR release version 5.26.0 or later. Notebook-scoped libraries are intended to be used only with the PySpark kernel. Any user can install additional notebook-scoped libraries from within a notebook cell. These libraries are only available to that notebook user during a single notebook session. If other users need the same libraries, or the same user needs the same libraries in a different session, the library must be re-installed.

Considerations and limitations

Consider the following when using notebook-scoped libraries:

  • You can uninstall only the libraries that are installed using install_pypi_package API. You cannot uninstall any libraries pre-installed on the cluster.

  • If the same libraries with different versions are installed on the cluster and as notebook-scoped libraries, the notebook-scoped library version overrides the cluster library version.

Working with Notebook-scoped libraries

To install libraries, your Amazon EMR cluster must have access to the PyPI repository where the libraries are located.

The following examples demonstrate simple commands to list, install, and uninstall libraries from within a notebook cell using the PySpark kernel and APIs. For additional examples, see Install Python libraries on a running cluster with EMR Notebooks post on the AWS Big Data Blog.

Example – Listing current libraries

The following command lists the Python packages available for the current Spark notebook session. This lists libraries installed on the cluster and notebook-scoped libraries.

sc.list_packages()

Example – Installing the Celery library

The following command installs the Celery library as a notebook-scoped library.

sc.install_pypi_package("celery")

After installing the library, the following command confirms that the library is available on the Spark driver and executors.

import celery sc.range(1,10000,1,100).map(lambda x: celery.__version__).collect()

Example – Installing the Arrow library, specifying the version and repository

The following command installs the Arrow library as a notebook-scoped library, with a specification of the library version and repository URL.

sc.install_pypi_package("arrow==0.14.0", "https://pypi.org/simple")

Example – Uninstalling a library

The following command uninstalls the Arrow library, removing it as a notebook-scoped library from the current session.

sc.uninstall_package("arrow")