SageMaker Spark for Python (PySpark) examples - Amazon SageMaker

SageMaker Spark for Python (PySpark) examples

Amazon SageMaker provides an Apache Spark Python library (SageMaker PySpark) that you can use to integrate your Apache Spark applications with SageMaker. For example, you might use Apache Spark for data preprocessing and SageMaker for model training and hosting. For information about the SageMaker Apache Spark library, see Use Apache Spark with Amazon SageMaker.

Download PySpark

You can download the source code for both Python Spark (PySpark) and Scala libraries from the SageMaker Spark GitHub repository.

For instructions on installing the SageMaker Spark library, use any the following options or visit SageMaker PySpark.

  • Install using pip:

    pip install sagemaker_pyspark
  • Install from the source:

    git clone git@github.com:aws/sagemaker-spark.git cd sagemaker-pyspark-sdk python setup.py install
  • You can also create a new notebook in a notebook instance that uses either the Sparkmagic (PySpark) or the Sparkmagic (PySpark3) kernel and connect to a remote Amazon EMR cluster.

    Note

    The Amazon EMR cluster must be configured with an IAM role that has the AmazonSageMakerFullAccess policy attached. For information about configuring roles for an EMR cluster, see Configure IAM Roles for Amazon EMR Permissions to AWS Services in the Amazon EMR Management Guide.

PySpark examples

For examples on using SageMaker PySpark, see:

To run the notebooks on a notebook instance, see Example Notebooks. To run the notebooks on Studio, see Create or Open an Amazon SageMaker Studio Classic Notebook.