Amazon Redshift integration for Apache Spark - Amazon Redshift

Amazon Redshift integration for Apache Spark

Apache Spark is a distributed processing framework and programming model that helps you do machine learning, stream processing, or graph analytics. Similar to Apache Hadoop, Spark is an open-source, distributed processing system commonly used for big data workloads. Spark has an optimized directed acyclic graph (DAG) execution engine and actively caches data in-memory. This can boost performance, especially for certain algorithms and interactive queries.

This integration provides you with a Spark connector you can use to build Apache Spark applications that read from and write to data in Amazon Redshift and Amazon Redshift Serverless. These applications don't compromise on application performance or transactional consistency of the data. This integration is automatically included in Amazon EMR and AWS Glue, so you can immediately run Apache Spark jobs that access and load data into Amazon Redshift as part of your data ingestion and transformation pipelines.

Currently, you can use the versions 3.3.0, 3.3.1, 3.3.2, and 3.4.0 of Spark with this integration.

This integration provides the following:

  • AWS Identity and Access Management (IAM) authentication. For more information, see Identity and access management in Amazon Redshift.

  • Predicate and query pushdown to improve performance.

  • Amazon Redshift data types.

  • Connectivity to Amazon Redshift and Amazon Redshift Serverless.

Considerations and limitations when using the Spark connector

  • The tempdir URI points to an Amazon S3 location. This temp directory is not cleaned up automatically and could add additional cost. We recommend using Amazon S3 lifecycle policies in the Amazon Simple Storage Service User Guide to define the retention rules for the Amazon S3 bucket.

  • By default, copies between Amazon S3 and Redshift don't work if the S3 bucket and Redshift cluster are in different AWS Regions. To use separate AWS Regions, set the tempdir_region parameter to the Region of the S3 bucket used for the tempdir.

  • Cross-Region writes between S3 and Redshift if writing Parquet data using the tempformat parameter.

  • We recommend using Amazon S3 server-side encryption to encrypt the Amazon S3 buckets used.

  • We recommend blocking public access to Amazon S3 buckets.

  • We recommend that the Amazon Redshift cluster should not be publicly accessible.

  • We recommend turning on Amazon Redshift audit logging.

  • We recommend turning on Amazon Redshift at-rest encryption.

  • We recommend turning on SSL for the JDBC connection from Spark on Amazon EMR to Amazon Redshift.

  • We recommend passing an IAM role using the parameter aws_iam_role for the Amazon Redshift authentication parameter.