Troubleshooting AWS Batch - AWS Batch

Troubleshooting AWS Batch

You might need to troubleshoot issues that are related to your compute environments, job queues, job definitions, or jobs. This chapter describes how to troubleshoot and repair these issues in your AWS Batch environment.

AWS Batch uses IAM policies, roles, and permissions, and runs on Amazon EC2, Amazon ECS, AWS Fargate, and Amazon Elastic Kubernetes Service infrastructure. To troubleshoot issues that are related to these services, see the following:

AWS Batch

INVALID compute environment

It's possible that you might have incorrectly configured a managed compute environment. If you did, the compute environment enters an INVALID state and can't accept jobs for placement. The following sections describe the possible causes and how to troubleshoot based on the cause.

Incorrect role name or ARN

The most common cause for a compute environment to enter an INVALID state is that the AWS Batch service role or the Amazon EC2 Spot Fleet role has an incorrect name or Amazon Resource Name (ARN). This is more common with compute environments that are created using the AWS CLI or the AWS SDKs. When you create a compute environment in the AWS Management Console, AWS Batch helps you choose the correct service or Spot Fleet roles. However, suppose that you manually enter the name or the ARN and enter them incorrectly. Then, the resulting compute environment is also INVALID.

However, if you manually enter the name or ARN for an IAM in an AWS CLI command or your SDK code, AWS Batch can't validate the string. It must accept the bad value and attempt to create the environment. If AWS Batch fails to create the environment, the environment moves to an INVALID state, and you see the following errors.

For an invalid service role:

CLIENT_ERROR - Not authorized to perform sts:AssumeRole (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: dc0e2d28-2e99-11e7-b372-7fcc6fb65fe7)

For an invalid Spot Fleet role:

CLIENT_ERROR - Parameter: SpotFleetRequestConfig.IamFleetRole is invalid. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidSpotFleetRequestConfig; Request ID: 331205f0-5ae3-4cea-bac4-897769639f8d) Parameter: SpotFleetRequestConfig.IamFleetRole is invalid

One common cause for this issue is if you only specify the name of an IAM role when using the AWS CLI or the AWS SDKs, instead of the full Amazon Resource Name (ARN). This is because, depending on how you created the role, the ARN might contain a aws-service-role path prefix. For example, if you manually create the AWS Batch service role using the procedures in AWS Batch service IAM role, your service role ARN might look like the following:

arn:aws:iam::123456789012:role/AWSBatchServiceRole

However, if you created the service role as part of the console first run wizard today, your service role ARN might look like the following:

arn:aws:iam::123456789012:role/aws-service-role/AWSBatchServiceRole

When you only specify the name of an IAM role when using the AWS CLI or the AWS SDKs, AWS Batch assumes that your ARN doesn't use the aws-service-role path prefix. Because of this, we recommend that you specify the full ARN for your IAM roles when you create compute environments.

To repair a compute environment that's misconfigured this way, see Repairing an INVALID compute environment.

Repairing an INVALID compute environment

When you have a compute environment in an INVALID state, update it to repair the invalid parameter. For an Incorrect role name or ARN, update the compute environment using the correct service role.

To repair a misconfigured compute environment
  1. Open the AWS Batch console at https://console.aws.amazon.com/batch/.

  2. From the navigation bar, select the AWS Region to use.

  3. In the navigation pane, choose Compute environments.

  4. On the Compute environments page, select the radio button next to the compute environment to edit, and then choose Edit.

  5. On the Update compute environment page, for Service role, choose the IAM role to use with your compute environment. The AWS Batch console only displays roles that have the correct trust relationship for compute environments.

  6. Choose Save to update your compute environment.

Jobs stuck in a RUNNABLE status

Suppose that your compute environment contains compute resources, but your jobs don't progress beyond the RUNNABLE status. Then, there's likely something preventing the jobs from actually being placed on a compute resource. Here are some common causes for this issue:

The awslogs log driver isn't configured on your compute resources

AWS Batch jobs send their log information to CloudWatch Logs. To enable this, you must configure your compute resources to use the awslogs log driver. Suppose that you base your compute resource AMI off of the Amazon ECS optimized AMI (or Amazon Linux). Then, this driver is registered by default with the ecs-init package. Now suppose that you use a different base AMI. Then, you must verify that the awslogs log driver is specified as an available log driver with the ECS_AVAILABLE_LOGGING_DRIVERS environment variable when the Amazon ECS container agent is started. For more information, see Compute resource AMI specification and Creating a compute resource AMI.

Insufficient resources

If your job definitions specify more CPU or memory resources than your compute resources can allocate, then your jobs aren't ever placed. For example, suppose that your job specifies 4 GiB of memory, and your compute resources have less than that available. Then it's the case that the job can't be placed on those compute resources. In this case, you must reduce the specified memory in your job definition or add larger compute resources to your environment. Some memory is reserved for the Amazon ECS container agent and other critical system processes. For more information, see Compute Resource Memory Management.

No internet access for compute resources

Compute resources need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your compute resources having public IP addresses.

For more information about interface VPC endpoints, see Amazon ECS Interface VPC Endpoints (AWS PrivateLink) in the Amazon Elastic Container Service Developer Guide.

If you do not have an interface VPC endpoint configured and your compute resources do not have public IP addresses, then they must use network address translation (NAT) to provide this access. For more information, see NAT gateways in the Amazon VPC User Guide. For more information, see Create a VPC.

Amazon EC2 instance limit reached

The number of Amazon EC2 instances that your account can launch in an AWS Region is determined by your EC2 instance quota. Certain instance types also have a per-instance-type quota. For more information about your account's Amazon EC2 instance quota including how to request a limit increase, see Amazon EC2 Service Limits in the Amazon EC2 User Guide for Linux Instances.

Amazon ECS container agent isn't installed

The Amazon ECS container agent must be installed on the Amazon Machine Image (AMI) to let AWS Batch run jobs. The Amazon ECS container agent is installed by default on Amazon ECS optimized AMIs. For more information about the Amazon ECS container agent, see Amazon ECS container agent in the Amazon Elastic Container Service Developer Guide.

For more information, see Why is my AWS Batch job stuck in RUNNABLE status? in the AWS Knowledge Center.

Spot Instances not tagged on creation

Spot Instance tagging for AWS Batch compute resources is supported as of October 25, 2017. Before, the recommended IAM managed policy (AmazonEC2SpotFleetRole) for the Amazon EC2 Spot Fleet role didn't contain permissions to tag Spot Instances at launch. The new recommended IAM managed policy is called AmazonEC2SpotFleetTaggingRole. It supports tagging Spot Instances at launch.

To fix Spot Instance tagging on creation, follow the following procedure to apply the current recommended IAM managed policy to your Amazon EC2 Spot Fleet role. That way, any future Spot Instances that are created with that role have permissions to apply instance tags when they're created.

To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role
  1. Open the IAM console at https://console.aws.amazon.com/iam/.

  2. Choose Roles, and choose your Amazon EC2 Spot Fleet role.

  3. Choose Attach policy.

  4. Select the AmazonEC2SpotFleetTaggingRole and choose Attach policy.

  5. Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.

  6. Select the x to the right of the AmazonEC2SpotFleetRole policy, and choose Detach.

Spot Instances not scaling down

AWS Batch introduced the AWSServiceRoleForBatch service-linked role on March 10, 2021. If no role is specified in the serviceRole parameter of the compute environment, this service-linked role is used as the service role. However, suppose that the service-linked role is used in an EC2 Spot compute environment, but the Spot role used doesn't include the AmazonEC2SpotFleetTaggingRole managed policy. Then, the Spot Instance don't scale down. As a result, you will receive an error with the following message: "You are not authorized to perform this operation." Use the following steps to update the spot fleet role that you use in the spotIamFleetRole parameter. For more information, see Using service-linked roles and Creating a role to delegate permissions to an AWS Service in the IAM User Guide.

Attach AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role in the AWS Management Console

To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role
  1. Open the IAM console at https://console.aws.amazon.com/iam/.

  2. Choose Roles, and choose your Amazon EC2 Spot Fleet role.

  3. Choose Attach policy.

  4. Select the AmazonEC2SpotFleetTaggingRole and choose Attach policy.

  5. Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.

  6. Select the x to the right of the AmazonEC2SpotFleetRole policy, and choose Detach.

Attach AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role with the AWS CLI

The example commands assume that your Amazon EC2 Spot Fleet role is named AmazonEC2SpotFleetRole. If your role uses a different name, adjust the commands to match.

To attach the AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role
  1. To attach the AmazonEC2SpotFleetTaggingRole managed IAM policy to your AmazonEC2SpotFleetRole role, run the following command using the AWS CLI.

    $ aws iam attach-role-policy \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole \ --role-name AmazonEC2SpotFleetRole
  2. To detach the AmazonEC2SpotFleetRole managed IAM policy from your AmazonEC2SpotFleetRole role, run the following command using the AWS CLI.

    $ aws iam detach-role-policy \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetRole \ --role-name AmazonEC2SpotFleetRole

Can't retrieve Secrets Manager secrets

If you use an AMI with an Amazon ECS agent that's earlier than version 1.16.0-1, then you must use the Amazon ECS agent configuration variable ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true to use this feature. You can add it to the ./etc/ecs/ecs.config file to a new container instance when you create that instance. Or, you can add it to an existing instance. If you add it to an existing instance, you must restart the ECS agent after you add it. For more information, see Amazon ECS Container Agent Configuration in the Amazon Elastic Container Service Developer Guide.

Can't override job definition resource requirements

The memory and vCPU overrides that are specified in the memory and vcpus members of the containerOverrides structure, which passed to SubmitJob, can't override the memory and vCPU requirements that are specified in the resourceRequirements structure in the job definition.

If you try to override these resource requirements, you might see the following error message:

"This value was submitted in a deprecated key and may conflict with the value provided by the job definition's resource requirements."

To correct this, specify the memory and vCPU requirements in the resourceRequirements member of the containerOverrides. For example, if your memory and vCPU overrides are specified in the following lines.

"containerOverrides": { "memory": 8192, "vcpus": 4 }

Change them to the following:

"containerOverrides": { "resourceRequirements": [ { "type": "MEMORY", "value": "8192" }, { "type": "VCPU", "value": "4" } ], }

Do the same change to the memory and vCPU requirements that are specified in the containerProperties object in the job definition. For example, if your memory and vCPU requirements are specified in the following lines.

{ "containerProperties": { "memory": 4096, "vcpus": 2, }

Change them to the following:

"containerProperties": { "resourceRequirements": [ { "type": "MEMORY", "value": "4096" }, { "type": "VCPU", "value": "2" } ], }

AWS Batch on Amazon EKS

INVALID compute environment

It's possible that you might have incorrectly configured a managed compute environment. If you did, the compute environment enters an INVALID state and can't accept jobs for placement. The following sections describe the possible causes and how to troubleshoot based on the cause.

Unsupported Kubernetes version

If the compute environment was created on an unsupported Kubernetes version, AWS Batch on Amazon EKS changes the compute environment status to INVALID. You see an error set in the statusReason parameter that resembles the following:

reason=CLIENT_ERROR - ... EKS Cluster version [1.xx] is unsupported

To resolve this issue, delete the compute environment and then re-create it with a supported Kubernetes version.

If you specify an unsupported Kubernetes version when you create or update a compute environment using the CreateComputeEnvironment or UpdateComputeEnvironment API operation, you see an error message that resembles the following.

At least one imageKubernetesVersion in EC2Configuration is not supported.

To resolve this issue, specify a supported Kubernetes version when you use an API operation to create or update a compute environment.

AWS Batch on Amazon EKS currently supports the following Kubernetes versions:

  • 1.24

  • 1.23

  • 1.22

  • 1.21

  • 1.20

Note

We recommend Kubernetes version 1.22 or later.

Instance profile doesn't exist

If the specified instance profile does not exist, the AWS Batch on Amazon EKS compute environment status is changed to INVALID. You see an error set in the statusReason parameter that resembles the following:

CLIENT_ERROR - Instance profile arn:aws:iam::...:instance-profile/<name> does not exist

To resolve this issue, specify or create a working instance profile. For more information, see Amazon EKS node IAM role in the Amazon EKS User Guide.

Invalid Kubernetes namespace

If AWS Batch on Amazon EKS can't validate the namespace for the compute environment, the compute environment status is changed to INVALID. For example, this issue can occur if the namespace doesn't exist.

You see an error message set in the statusReason parameter that resembles the following:

CLIENT_ERROR - Unable to validate Kubernetes Namespace

This issue can occur if any of the following are true:

  • The Kubernetes namespace string in the CreateComputeEnvironment call doesn't exist. For more information, see CreateComputeEnvironment.

  • The required Role-Based Access Control (RBAC) permissions to manage the namespace are not configured correctly.

  • AWS Batch doesn't have access to the Amazon EKS Kubernetes API server endpoint.

To resolve this issue, see Verify that the aws-auth ConfigMap is configured correctly. For more information, see Getting started with AWS Batch on Amazon EKS .

Deleted compute environment

Suppose that you delete an Amazon EKS cluster before you delete the attached AWS Batch on Amazon EKS compute environment. Then, the compute environment status is changed to INVALID. In this scenario, the compute environment doesn't work properly if you re-create the Amazon EKS cluster with the same name.

To resolve this issue, delete and then re-create the AWS Batch on Amazon EKS compute environment.

Nodes don't join the Amazon EKS cluster

AWS Batch on Amazon EKS scales down a compute environment if it determines that not all nodes joined the Amazon EKS cluster. When AWS Batch on Amazon EKS scales down the compute environment, the compute environment status is changed to INVALID.

Note

AWS Batch doesn't change the compute environment status immediately so that you can debug the issue.

You see an error message set in the statusReason parameter that resembles ones of the following:

Your compute environment has been INVALIDATED and scaled down because none of the instances joined the underlying ECS Cluster. Common issues preventing instances joining are the following: VPC/Subnet configuration preventing communication to ECS, incorrect Instance Profile policy preventing authorization to ECS, or customized AMI or LaunchTemplate configurations affecting ECS agent.

Your compute environment has been INVALIDATED and scaled down because none of the nodes joined the underlying EKS Cluster. Common issues preventing nodes joining are the following: networking configuration preventing communication to EKS Cluster, incorrect EKS Instance Profile or Kubernetes RBAC policy preventing authorization to EKS Cluster, customized AMI or LaunchTemplate configurations affecting EKS/Kubernetes node bootstrap.

When using a default Amazon EKS AMI, the most common causes of this issue are the following:

AWS Batch on Amazon EKS job is stuck in RUNNABLE status

An aws-auth ConfigMap is automatically created and applied to your cluster when you create a managed node group or when you create a node group using eksctl. It's initially created to allow nodes to join your cluster. However, you also use the aws-authConfigMap to add role-based access control (RBAC) access to users and roles.

To verify that the aws-auth ConfigMap is configured correctly:

  1. Retrieve the mapped roles in the aws-auth ConfigMap:

    $ kubectl get configmap -n kube-system aws-auth -o yaml
  2. Verify that the roleARN is configured as follows:

    rolearn: arn:aws:iam::aws_account_number:role/AWSServiceRoleForBatch

    Note

    You can also review the Amazon EKS control plane logs. For more information, see Amazon EKS control plane logging in the Amazon EKS User Guide.

To resolve an issue where a job is stuck in a RUNNABLE status, we recommend that you use kubectl to re-apply the manifest. For more information, see Step 1: Preparing your EKS cluster for AWS Batch. Or, you can use kubectl to manually edit the aws-auth ConfigMap. For more information, see Enabling IAM user and role access to your cluster in the Amazon EKS User Guide.

Verify that the aws-auth ConfigMap is configured correctly

To verify that the aws-auth ConfigMap is configured correctly:

  1. Retrieve the mapped roles in the aws-auth ConfigMap:

    $ kubectl get configmap -n kube-system aws-auth -o yaml
  2. Verify that the roleARN is configured as follows:

    rolearn: arn:aws:iam::aws_account_number:role/AWSServiceRoleForBatch

    Note

    The path aws-service-role/batch.amazonaws.com/ has been removed from the ARN of the service-linked role. This is because of an issue with the aws-auth configuration map. For more information, see Roles with paths do not work when the path is included in their ARN in the aws-authconfigmap.

    Note

    You can also review the Amazon EKS control plane logs. For more information, see Amazon EKS control plane logging in the Amazon EKS User Guide.

To resolve an issue where a job is stuck in a RUNNABLE status, we recommend that you use kubectl to re-apply the manifest. For more information, see Step 1: Preparing your EKS cluster for AWS Batch. Or, you can use kubectl to manually edit the aws-auth ConfigMap. For more information, see Enabling IAM user and role access to your cluster in the Amazon EKS User Guide.

RBAC permissions or bindings aren't configured properly

If you experience any RBAC permissions or binding issues, verify that the aws-batch Kubernetes role can access the Kubernetes namespace:

$ kubectl get namespace namespace --as=aws-batch
$ kubectl auth can-i get ns --as=aws-batch

You can also use the kubectl describe command to view the authorizations for a cluster role or Kubernetes namespace.

$ kubectl describe clusterrole aws-batch-cluster-role

The following is example output.

Name: aws-batch-cluster-role Labels: <none> Annotations: <none> PolicyRule: Resources Non-Resource URLs Resource Names Verbs --------- ----------------- -------------- ----- configmaps [] [] [get list watch] nodes [] [] [get list watch] pods [] [] [get list watch] daemonsets.apps [] [] [get list watch] deployments.apps [] [] [get list watch] replicasets.apps [] [] [get list watch] statefulsets.apps [] [] [get list watch] clusterrolebindings.rbac.authorization.k8s.io [] [] [get list] clusterroles.rbac.authorization.k8s.io [] [] [get list] namespaces [] [] [get]
$ kubectl describe role aws-batch-compute-environment-role -n my-aws-batch-namespace

The following is example output.

Name: aws-batch-compute-environment-role Labels: <none> Annotations: <none> PolicyRule: Resources Non-Resource URLs Resource Names Verbs --------- ----------------- -------------- ----- pods [] [] [create get list watch delete patch] serviceaccounts [] [] [get list] rolebindings.rbac.authorization.k8s.io [] [] [get list] roles.rbac.authorization.k8s.io [] [] [get list]

To resolve this issue, re-apply the RBAC permissions and rolebinding commands. For more information, see Step 1: Preparing your EKS cluster for AWS Batch.