What is Amazon EMR Serverless? - Amazon EMR

Amazon EMR Serverless is in preview release and is subject to change. To use EMR Serverless in preview, follow the sign up steps at https://pages.awscloud.com/EMR-Serverless-Preview.html. The only Region that EMR Serverless currently supports is us-east-1, so make sure to set all region parameters to this value. All Amazon S3 buckets used with EMR Serverless must also be created in us-east-1.

What is Amazon EMR Serverless?

Amazon EMR Serverless is a new deployment option for Amazon EMR. EMR Serverless provides a serverless runtime environment that simplifies running analytics applications using the latest open source frameworks such as Apache Spark and Apache Hive. With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications with these frameworks.

EMR Serverless helps you avoid over- or under-provisioning resources for your data processing jobs. EMR Serverless automatically determines the resources required by the applications, acquires these resources to process your jobs, and relinquishes them when the jobs finish. For use cases where applications require a response within seconds, such as interactive data analysis, you can pre-initialize required resources during application creation.

With EMR Serverless, you'll continue to get the benefits of Amazon EMR such as open source compatibility, concurrency, and performance optimized runtime for popular frameworks.

EMR Serverless is suitable for customers who want ease in operating applications using open source frameworks. It offers easy provisioning, quick job startup, automatic capacity management, and simple cost controls.

Concepts

EMR Serverless terms and concepts.

Release version

An Amazon EMR release is a set of open-source applications from the big data ecosystem. Each release comprises different big data applications, components, and features that you select to have EMR Serverless deploy and configure to run your applications. When creating an application, you must specify its release version. You'll choose the Amazon EMR release version along with the open source framework version you want to use in your application.

Application

With EMR Serverless, you can create one or more EMR Serverless applications that use open source analytics frameworks. To create an application, you must specify the following attributes:

  • The Amazon EMR release version for the open source framework version you want to use. To determine your release version, see Release versions.

  • The specific runtime that you want your application to use, such as Apache Spark or Apache Hive.

After you create an application, you can schedule data processing jobs or interactive requests to your application.

Each EMR Serverless application is strictly isolated from other applications and runs on a secure Amazon Virtual Private Cloud (VPC). Additionally, you can use IAM policies to define which IAM users and roles can access the application. You can also specify limits to control and track usage costs incurred by the application.

Consider creating multiple applications for the following scenarios:

  • Using different open source frameworks

  • Using different versions of open source frameworks for different use cases

  • Performing A/B testing when upgrading from one version to another

  • Maintaining separate logical environments for test and production scenarios

  • Providing separate logical environments for different teams with independent cost controls and usage tracking

  • Separating different line-of-business applications

EMR Serverless is a Regional service that simplifies running workloads across multiple Availability Zones within a Region. To learn more about using applications with EMR Serverless, see Interacting with your application.

Job run

A job run is a request submitted to an EMR Serverless application that is asynchronously executed and tracked through completion. Examples of jobs include a HiveQL query submitted to an Apache Hive application or a PySpark data processing script submitted to an Apache Spark application. When submitting a job, you must specify an execution role, authored in IAM, that will be used by the job to access AWS resources, such as Amazon S3 objects. Multiple job run requests can be submitted to an application, and each job run can use a different execution role to access AWS resources. EMR Serverless starts executing jobs as soon as they are received and runs multiple job requests concurrently. To learn more about running jobs, see Running jobs.

Workers

An EMR Serverless application internally uses workers to execute your workloads. The default size of these workers are based on your application type and Amazon EMR release version. You can override these sizes when scheduling a job run.

When a job is submitted, EMR Serverless computes the resources needed for the job and schedules workers. EMR Serverless breaks down your workloads into tasks, downloads images, provisions and sets up workers, and decommissions them when the job finishes. EMR Serverless automatically scales workers up or down depending on the workload and parallelism required at every stage of the job, removing the need for you to estimate the number of workers required to run your workloads.

Pre-initialized capacity

EMR Serverless provides a feature that keeps workers initialized and ready to respond in seconds, effectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be configured for each application by setting the initial-capacity parameter of an application. Pre-initialized capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs. To learn more about pre-initialized workers, see Configuring and managing pre-initialized capacity.