Using Python libraries with EMR Serverless
When you run PySpark jobs on Amazon EMR Serverless applications, you can package various Python libraries as dependencies. To do this, you can use native Python features, build a virtual environment, or directly configure your PySpark jobs to use Python libraries. This page covers each approach.
Using native Python features
When you set the following configuration, you can use PySpark to upload Python files
(.py
), zipped Python packages (.zip
), and Egg files
(.egg
) to Spark executors.
--conf spark.submit.pyFiles=s3://
DOC-EXAMPLE-BUCKET
/EXAMPLE-PREFIX
/<.py|.egg|.zip file>
For more details about how to use Python virtual environments for PySpark jobs, see
Using PySpark Native Features
Building a Python virtual environment
To package multiple Python libraries for a PySpark job, you can create isolated Python virtual environments.
-
To build the Python virtual environment, use the following commands. The example shown installs the packages
scipy
andmatplotlib
into a virtual environment package and copies the archive to an Amazon S3 location.Important
You must run the following commands in a similar Amazon Linux 2 environment with the same version of Python as you use in EMR Serverless, that is, Python 3.7.10 for Amazon EMR release 6.6.0. You can find an example Dockerfile in the EMR Serverless Samples
GitHub repository. # initialize a python virtual environment python3 -m venv pyspark_venvsource source pyspark_venvsource/bin/activate # optionally, ensure pip is up-to-date pip3 install --upgrade pip # install the python packages pip3 install scipy pip3 install matplotlib # package the virtual environment into an archive pip3 install venv-pack venv-pack -f -o pyspark_venv.tar.gz # copy the archive to an S3 location aws s3 cp pyspark_venv.tar.gz s3://
DOC-EXAMPLE-BUCKET
/EXAMPLE-PREFIX
/ # optionally, remove the virtual environment directory rm -fr pyspark_venvsource -
Submit the Spark job with your properties set to use the Python virtual environment.
--conf spark.archives=s3://
DOC-EXAMPLE-BUCKET
/EXAMPLE-PREFIX
/pyspark_venv.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/pythonNote that if you don't override the original Python binary, the second configuration in the preceding sequence of settings will be
--conf spark.executorEnv.PYSPARK_PYTHON=python
.For more on how to use Python virtual environments for PySpark jobs, see Using Virtualenv
. For more examples of how to submit Spark jobs, see Spark jobs.
Configuring PySpark jobs to use Python libraries
With Amazon EMR releases 6.12.0 and higher, you can directly configure EMR Serverless PySpark
jobs to use popular data science Python libraries like pandas
The following examples show how to package each Python library for a PySpark job.