Using native Python features Building a Python virtual environment Configuring PySpark jobs to use Python libraries

Using Python libraries with EMR Serverless

When you run PySpark jobs on Amazon EMR Serverless applications, you can package various Python libraries as dependencies. To do this, you can use native Python features, build a virtual environment, or directly configure your PySpark jobs to use Python libraries. This page covers each approach.

Using native Python features

When you set the following configuration, you can use PySpark to upload Python files (.py), zipped Python packages (.zip), and Egg files (.egg) to Spark executors.


--conf spark.submit.pyFiles=s3://amzn-s3-demo-bucket/EXAMPLE-PREFIX/<.py|.egg|.zip file>

For more details about how to use Python virtual environments for PySpark jobs, see Using PySpark Native Features.

Building a Python virtual environment

To package multiple Python libraries for a PySpark job, you can create isolated Python virtual environments.

To build the Python virtual environment, use the following commands. The example shown installs the packages scipy and matplotlib into a virtual environment package and copies the archive to an Amazon S3 location.

Important
You must run the following commands in a similar Amazon Linux 2 environment with the same version of Python as you use in EMR Serverless, that is, Python 3.7.10 for Amazon EMR release 6.6.0. You can find an example Dockerfile in the EMR Serverless Samples GitHub repository.
```
# initialize a python virtual environment
python3 -m venv pyspark_venvsource
source pyspark_venvsource/bin/activate

# optionally, ensure pip is up-to-date
pip3 install --upgrade pip

# install the python packages
pip3 install scipy
pip3 install matplotlib

# package the virtual environment into an archive
pip3 install venv-pack
venv-pack -f -o pyspark_venv.tar.gz

# copy the archive to an S3 location
aws s3 cp pyspark_venv.tar.gz s3://amzn-s3-demo-bucket/EXAMPLE-PREFIX/

# optionally, remove the virtual environment directory
rm -fr pyspark_venvsource
```
Submit the Spark job with your properties set to use the Python virtual environment.
```
--conf spark.archives=s3://amzn-s3-demo-bucket/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
```
Note that if you don't override the original Python binary, the second configuration in the preceding sequence of settings will be --conf spark.executorEnv.PYSPARK_PYTHON=python.

For more on how to use Python virtual environments for PySpark jobs, see Using Virtualenv. For more examples of how to submit Spark jobs, see Using Spark configurations when you run EMR Serverless jobs.

Configuring PySpark jobs to use Python libraries

With Amazon EMR releases 6.12.0 and higher, you can directly configure EMR Serverless PySpark jobs to use popular data science Python libraries like pandas, NumPy, and PyArrow without any additional setup.

The following examples show how to package each Python library for a PySpark job.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Using Iceberg

Using different Python versions

Using Python libraries with EMR Serverless

Using native Python features

Building a Python virtual environment

Important

Configuring PySpark jobs to use Python libraries