Using Python libraries with EMR Serverless - Amazon EMR

Using Python libraries with EMR Serverless

When you run PySpark jobs on Amazon EMR Serverless applications, you can package various Python libraries as dependencies. To do this, you can use native Python features or virtual environments.

To use native Python features

When you set the following configuration, you can use PySpark to upload Python files (.py), zipped Python packages (.zip), and Egg files (.egg) to Spark executors.

--conf spark.submit.pyFiles=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/<.py|.egg|.zip file>

For more details about how to use Python virtual environments for PySpark jobs, see Using PySpark Native Features.

To build a Python virtual environment

To package multiple Python libraries for a PySpark job, you can create isolated Python virtual environments.

  1. To build the Python virtual environment, use the following commands. The example shown installs the packages pyarrow and pandas into a virtual environment package and copies the archive to an Amazon S3 location.

    Important

    You must run the following commands in a similar Amazon Linux 2 environment with the same version of Python as you use in EMR Serverless, that is, Python 3.7.10 for Amazon EMR release 6.6.0. You can find an example Dockerfile in the EMR Serverless Samples GitHub repository.

    # initialize a python virtual environment python3 -m venv pyspark_venvsource source pyspark_venvsource/bin/activate # optionally, ensure pip is up-to-date pip3 install --upgrade pip # install the python packages pip3 install pyarrow pip3 install pandas # package the virtual environment into an archive pip3 install venv-pack venv-pack -f -o pyspark_venv.tar.gz # copy the archive to an S3 location aws s3 cp pyspark_venv.tar.gz s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/ # optionally, remove the virtual environment directory rm -fr pyspark_venvsource
  2. Submit the Spark job with your properties set to use the Python virtual environment.

    --conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python

    Note that if you don't override the original Python binary, the second configuration in the preceding sequence of settings will be --conf spark.executorEnv.PYSPARK_PYTHON=python.

    For more on how to use Python virtual environments for PySpark jobs, see Using Virtualenv. For more examples of how to submit Spark jobs, see Spark jobs.