Configuring settings for interactive endpoints
Monitoring Spark jobs
So that you can monitor and troubleshoot failures, configure your interactive endpoints so that the jobs initiated with the endpoint can send log information to Amazon S3, Amazon CloudWatch Logs, or both. The following sections describe how to send Spark application logs to Amazon S3 for the Spark jobs that you launch with Amazon EMR on EKS interactive endpoints.
Configure IAM policy for Amazon S3 logs
Before your kernels can send log data to Amazon S3, the permissions policy for the job
execution role must include the following permissions. Replace
DOC-EXAMPLE-BUCKET-LOGGING
with the name of your logging
bucket.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::
DOC-EXAMPLE-BUCKET-LOGGING
", "arn:aws:s3:::DOC-EXAMPLE-BUCKET-LOGGING
/*", ] } ] }
Note
Amazon EMR on EKS can also create an S3 bucket. If an S3 bucket is not available, include the
s3:CreateBucket
permission in the IAM policy.
After you've given your execution role the permissions it needs to send logs to the S3
bucket, your log data is sent to the following Amazon S3 locations. This happens when
s3MonitoringConfiguration
is passed in the
monitoringConfiguration
section of a create-managed-endpoint
request.
-
Driver logs –
logUri/virtual-cluster-id/endpoints/endpoint-id/containers/spark-application-id/spark-application-id-driver/(stderr.gz/stdout.gz)
-
Executor logs –
logUri/virtual-cluster-id/endpoints/endpoint-id/containers/spark-application-id/executor-pod-name-exec-<Number>/(stderr.gz/stdout.gz)
Note
Amazon EMR on EKS doesn't upload the endpoint logs to your S3 bucket.
Specifying custom pod templates with interactive endpoints
You can create interactive endpoints where you specify custom pod templates for drivers and executors. Pod templates are specifications that determine how to run each pod. You can use pod template files to define the configurations of driver or executor pods that Spark configurations don't support. Pod templates are currently supported in Amazon EMR releases 6.3.0 and greater.
For more information about pod templates, see Using pod templates in the Amazon EMR on EKS Development Guide.
The following example shows how to create an interactive endpoint with pod templates:
aws emr-containers create-managed-endpoint \ --type JUPYTER_ENTERPRISE_GATEWAY \ --virtual-cluster-id
virtual-cluster-id
\ --nameexample-endpoint-name
\ --execution-role-arn arn:aws:iam::aws-account-id
:role/EKSClusterRole
\ --release-labelemr-6.9.0-latest
\ --configuration-overrides '{ "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.kubernetes.driver.podTemplateFile": "path/to/driver/template.yaml
", "spark.kubernetes.executor.podTemplateFile": "path/to/executor/template.yaml
" } }] }'
Deploying a JEG pod to a node group
JEG (Jupyter Enterprise Gateway) pod placement is a feature that allows you to deploy an
interactive endpoint on a specific node group. With this feature, you can configure settings
such as instance type
for the interactive endpoint.
Associating a JEG pod to a managed node group
The following configuration property allows you to specify the name of a managed node group on your Amazon EKS cluster where the JEG pod will be deployed.
//payload --configuration-overrides '{ "applicationConfiguration": [ { "classification": "endpoint-configuration", "properties": { "managed-nodegroup-name":
NodeGroupName
} } ] }'
A node group must have the Kubernetes label
for-use-with-emr-containers-managed-endpoint-ng=
attached to all nodes that are part of the node group. To list all nodes of a node group
that have this tag, use the following command:NodeGroupName
kubectl get nodes --show-labels | grep for-use-with-emr-containers-managed-endpoint-ng=
NodeGroupName
If the output of the command above doesn't return nodes that are part of your managed
node group, then there are no nodes in the node group that have the
for-use-with-emr-containers-managed-endpoint-ng=
Kubernetes label attached. In this case, follow the steps below to attach that label to
the nodes in your node group.NodeGroupName
-
Use the following command to add the
for-use-with-emr-containers-managed-endpoint-ng=
Kubernetes label to all nodes in a managed node groupNodeGroupName
:NodeGroupName
kubectl label nodes --selector eks:nodegroup-name=
NodeGroupName
for-use-with-emr-containers-managed-endpoint-ng=NodeGroupName
-
Verify that the nodes were labeled correctly using the following command:
kubectl get nodes --show-labels | grep for-use-with-emr-containers-managed-endpoint-ng=
NodeGroupName
A managed node group must be associated with an Amazon EKS cluster’s security group, which
is usually the case if you created your cluster and managed node group using
eksctl
. You can verify this in the AWS console using the following
steps.
-
Go to your cluster in the Amazon EKS console.
-
Go to the networking tab of your cluster and note down the cluster security group.
-
Go to the compute tab of your cluster and click on the managed node group name.
-
Under the Details tab of the managed node group, verify that the cluster security group that you noted previously is listed under Security groups.
If the managed node group is not attached to the Amazon EKS cluster security group, you
need to attach the
for-use-with-emr-containers-managed-endpoint-sg=
tag to the node group security group. Use the steps below to attach this tag.ClusterName
/NodeGroupName
-
Go to the Amazon EC2 console and click on security groups on the left navigation pane.
-
Select your managed node group’s security group by clicking the checkbox.
-
Under the Tags tab, add the tag
for-use-with-emr-containers-managed-endpoint-sg=
using the Manage tags button.ClusterName
/NodeGroupName
Associating a JEG pod to a self-managed node group
The following configuration property allows you to specify the name of a self-managed or unmanaged node group on the Amazon EKS cluster where the JEG pod will be deployed.
//payload --configuration-overrides '{ "applicationConfiguration": [ { "classification": "endpoint-configuration", "properties": { "self-managed-nodegroup-name":
NodeGroupName
} } ] }'
The node group must have
for-use-with-emr-containers-managed-endpoint-ng=
Kubernetes label attached to all nodes that are part of the node group. To list all the
nodes of a node group that have this tag, use the following command:NodeGroupName
kubectl get nodes --show-labels | grep for-use-with-emr-containers-managed-endpoint-ng=
NodeGroupName
If the output of the command above doesn't return nodes that are part of your
self-managed node group, then there are no nodes in the node group that have the
for-use-with-emr-containers-managed-endpoint-ng=
Kubernetes label attached. In this case, follow the steps below to attach that label to
the nodes in your node group.NodeGroupName
-
If you created the self-managed node group using
eksctl
, then use the following command to add thefor-use-with-emr-containers-managed-endpoint-ng=
Kubernetes label to all nodes in the self-managed node groupNodeGroupName
at once.NodeGroupName
kubectl label nodes --selector alpha.eksctl.io/nodegroup-name=
NodeGroupName
for-use-with-emr-containers-managed-endpoint-ng=NodeGroupName
If you didn’t use
eksctl
to create the self-managed node group, then you will need to replace the selector in the above command to a different Kubernetes label that is attached to all the nodes of the node group. -
Use the following command to verify that the nodes were labeled correctly:
kubectl get nodes --show-labels | grep for-use-with-emr-containers-managed-endpoint-ng=
NodeGroupName
The security group for the self-managed node group must have the
for-use-with-emr-containers-managed-endpoint-sg=
tag attached. Use the following steps to attach the tag to the security group from the
AWS Management Console.ClusterName
/NodeGroupName
-
Navigate to the Amazon EC2 console. Select Security groups on the left navigation pane.
-
Select the checkbox next to the security group for your self-managed node group.
-
Under the Tags tab, use the Manage tags button to add the tag
for-use-with-emr-containers-managed-endpoint-sg=
. ReplaceClusterName
/NodeGroupName
andClusterName
with appropriate values.NodeGroupName
Associating a JEG pod to a managed node group with On-Demand instances
You can also define additional labels, known as Kubernetes label selectors, to specify additional constraints or restrictions to run an interactive endpoint on a given node or node group. The following example shows how to use On-Demand Amazon EC2 instances for a JEG pod.
--configuration-overrides '{ "applicationConfiguration": [ { "classification": "endpoint-configuration", "properties": { "managed-nodegroup-name":
NodeGroupName
, "node-labels": "eks.amazonaws.com/capacityType:ON_DEMAND" } } ] }'
Note
You can only use the node-labels
property with
either with a managed-nodegroup-name
or
self-managed-nodegroup-name
property.
Jupyter Enterprise Gateway (JEG) configuration options
Amazon EMR on EKS uses Jupyter Enterprise Gateway (JEG) to turn on interactive endpoints. You can set the following values for the allow-listed JEG configurations when you create the endpoint.
-
RemoteMappingKernelManager.cull_idle_timeout
– Timeout in seconds (integer), after which a kernel is considered idle and ready to be culled. Values of0
or lower deactivate culling. Short timeouts might result in kernels being culled for users with poor network connections. -
RemoteMappingKernelManager.cull_interval
– The interval in seconds (integer) on which to check for idle kernels that exceed the cull timeout value.
Modifying PySpark session parameters
Starting with Amazon EMR on EKS release 6.9.0, in Amazon EMR Studio you can adjust the Spark
configuration associated with a PySpark session by executing the %%configure
magic command in the EMR notebook cell.
The following example shows a sample payload that you can use to modify memory, cores,
and other properties for the Spark driver and executor. For the conf
settings,
you can configure any Spark configuration mentioned in the Apache Spark configuration
documentation
%%configure -f { "driverMemory": "16G", "driverCores" 4, "executorMemory" : "32G" "executorCores": 2, "conf": { "spark.dynamicAllocation.maxExecutors" : 10, "spark.dynamicAllocation.minExecutors": 1 } }
The following example shows a sample payload that you can use to add files, pyFiles, and jar dependencies to a Spark runtime.
%%configure -f { "files": "s3://
test-bucket-emr-eks/sample_file
.txt", "pyFiles": : "path-to-python-files
", "jars" : "path-to-jars
}
Custom kernel image with interactive endpoint
To ensure that you have the correct dependencies for your application when you run interactive workloads from Amazon EMR Studio, you can customize Docker images for interactive endpoints and run customized base kernel images. To create an interactive endpoint and connect it with a custom Docker image, perform the following steps.
Note
You can only override base images. You can't add new kernel image types.
-
Create and publish a customized Docker image. The base image contains the Spark runtime and the notebook kernels that run with it. To create the image, you can follow steps 1 through 4 in How to customize Docker images. In step 1, the base image URI in your Docker file must use
notebook-spark
in place ofspark
.ECR-registry-account
.dkr.ecr.Region
.amazonaws.com/notebook-spark/container-image-tag
For more information on how to select AWS Regions and container image tags, see How to select a base image URI.
-
Create an interactive endpoint that can be used with the custom image.
-
Create a JSON file
custom-image-managed-endpoint.json
with the following contents. This example uses Amazon EMR release 6.9.0.{ "name": "
endpoint-name
", "virtualClusterId": "virtual-cluster-id
", "type": "JUPYTER_ENTERPRISE_GATEWAY", "releaseLabel": "emr-6.9.0-latest
", "executionRoleArn": "execution-role-arn
", "configurationOverrides": { "applicationConfiguration": [ { "classification": "jupyter-kernel-overrides", "configurations": [ { "classification": "python3", "properties": { "container-image": "123456789012.dkr.ecr.us-west-2.amazonaws.com/custom-notebook-python:latest
" } }, { "classification": "spark-python-kubernetes", "properties": { "container-image": "123456789012.dkr.ecr.us-west-2.amazonaws.com/custom-notebook-spark:latest
" } } ] } ] } } -
Create an interactive endpoint with the configurations specified in the JSON file as shown in the following example. For more information, see Create an interactive endpoint with the create-managed-endpoint command.
aws emr-containers create-managed-endpoint --cli-input-json custom-image-managed-endpoint.json
-
-
Connect to the interactive endpoint via EMR Studio. For more information and steps to complete, see Connecting from Studio
in the Amazon EMR on EKS section of the AWS Workshop Studio docs.