Configuring and managing pre-initialized capacity - Amazon EMR

Amazon EMR Serverless is in preview release and is subject to change. To use EMR Serverless in preview, follow the sign up steps at https://pages.awscloud.com/EMR-Serverless-Preview.html. The only Region that EMR Serverless currently supports is us-east-1, so make sure to set all region parameters to this value. All Amazon S3 buckets used with EMR Serverless must also be created in us-east-1.

Configuring and managing pre-initialized capacity

EMR Serverless provides an optional feature that keeps driver and workers pre-initialized and ready to respond in seconds, effectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be configured for each application by setting the initialCapacity parameter of an application to the number of workers you want to pre-initialize. Pre-initialized worker capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs.

When a job is run, if any workers from initialCapacity are available (not already in use from jobs previously submitted), those resources are used to start running the job. If those resources are not available because they are in use by other jobs, or if resources are insufficient to execute the job because the job requires more than what is available from intialCapacity, then additional workers are requested and acquired, up to the maximum limits on resources set for the application. When jobs finish running, the workers used by the job are released, and the number of resources available for the application returns to initialCapacity. An application maintains the initialCapacity of resources even after jobs finish running. Excess resources beyond initialCapacity are released immediately when they're no longer required to run jobs.

Note

For this preview release, you must manually stop an application to decommission the pre-initialized workers.

Pre-initialized capacity is available and ready to use when the application has started. It is decommissioned when the application is stopped. An application moves to the STARTED state only if the requested pre-initialized capacity has been created and is ready to use. For the entire duration that the application is in the STARTED state, EMR Serverless ensures that the pre-initialized capacity is available for use or is in use by jobs or interactive workloads. Capacity is replenished for released or failed containers to maintain the number of workers specified in the InitialCapacity parameter. For an application with no pre-initialized capacity, the state can immediately transition from CREATED to STARTED.

You can modify the InitialCapacity counts, and specify compute configurations such as vCPU, memory, and disk, for each worker. Modifications are only allowed when the application is in the CREATED or STOPPED state.

Customizing pre-initialized capacity for specific big data frameworks

You can further customize pre-initialized capacity to suit workloads running on specific big data frameworks. For example, when running Apache Spark, you can specify how many workers start as drivers and how many start as executors. Similarly, when you use Apache Hive, you can specify how many workers start as Hive drivers, and how many are used to run Tez tasks.

Configuring an application running Apache Hive with pre-initialized capacity

The following API request creates an application running Apache Hive based on Amazon EMR release emr-5.34.0-preview. The application starts with 3 pre-initialized Hive drivers, each with 2 vCPU and 6 GB of memory, and 30 pre-initialized Tez task workers, each with 1 vCPU and 6 GB of memory. When Hive queries are run on this application, they first use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Hive jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.

aws emr-serverless create-application \ --type HIVE \ --name <my_application_name> \ --release-label emr-5.34.0-preview \ --initial-capacity '{ "DRIVER": { "workerCount": 5, "resourceConfiguration": { "cpu": "2vCPU", "memory": "4GB" } }, "TEZ_TASK": { "workerCount": 50, "resourceConfiguration": { "cpu": "4vCPU", "memory": "8GB" } } }' \ --maximum-capacity '{ "cpu": "400vCPU", "memory": "1024GB" }'

Configuring an application running Apache Spark with pre-initialized capacity

The following API request creates an application running Apache Spark 3.1 based on Amazon EMR release 6.5. The application starts with 5 pre-initialized Spark drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized executors, each with 4 vCPU and 8 GB of memory. When Spark jobs are run on this application, they first use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Spark jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.

Note

Spark adds 10% overhead to the memory requested for driver and executors. In order for jobs to use pre-initialized workers, the initial capacity memory configuration should be at least 10% more than the memory requested by the job.

aws emr-serverless create-application \ --type "SPARK" \ --name <"my_application_name"> \ --release-label "emr-6.5.0-preview" \ --initial-capacity '{ "DRIVER": { "workerCount": 5, "resourceConfiguration": { "cpu": "2vCPU", "memory": "4GB" } }, "EXECUTOR": { "workerCount": 50, "resourceConfiguration": { "cpu": "4vCPU", "memory": "8GB" } } }' \ --maximum-capacity '{ "cpu": "400vCPU", "memory": "1024GB" }'