Troubleshooting AWS Batch
You might need to troubleshoot issues that are related to your compute environments, job queues, job definitions, or jobs. This chapter describes how to troubleshoot and resolve such issues in your AWS Batch environment.
AWS Batch uses IAM policies, roles, and permissions, and runs on Amazon EC2, Amazon ECS, AWS Fargate, and Amazon Elastic Kubernetes Service infrastructure. To troubleshoot issues that are related to these services, see the following:
-
Troubleshooting IAM in the IAM User Guide
-
Amazon ECS troubleshooting in the Amazon Elastic Container Service Developer Guide
-
Amazon EKS troubleshooting in the Amazon EKS User Guide
-
Troubleshoot EC2 instances in the Amazon EC2 User Guide for Linux Instances
Contents
- AWS Batch
- AWS Batch on Amazon EKS
AWS Batch
INVALID
compute environment
It's possible that you might have incorrectly configured a managed compute environment. If
you did, the compute environment enters an INVALID
state and can't accept jobs for
placement. The following sections describe the possible causes and how to troubleshoot based on
the cause.
Incorrect role name or ARN
The most common cause for a compute environment to enter an INVALID
state is
that the AWS Batch service role or the Amazon EC2 Spot Fleet role has an incorrect name or Amazon
Resource Name (ARN). This is more common with compute environments that are created using the
AWS CLI or the AWS SDKs. When you create a compute environment in the AWS Management Console, AWS Batch helps
you choose the correct service or Spot Fleet roles. However, suppose that you manually enter
the name or the ARN and enter them incorrectly. Then, the resulting compute environment is also
INVALID
.
However, suppose that you manually enter the name or ARN for an IAM resource in an AWS CLI
command or your SDK code. In this case, AWS Batch can't validate the string. Instead, AWS Batch
must accept the bad value and attempt to create the environment. If AWS Batch fails to create the
environment, the environment moves to an INVALID
state, and you see the following
errors.
For an invalid service role:
CLIENT_ERROR - Not authorized to perform sts:AssumeRole (Service:
AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID:
dc0e2d28-2e99-11e7-b372-7fcc6fb65fe7)
For an invalid Spot Fleet role:
CLIENT_ERROR - Parameter: SpotFleetRequestConfig.IamFleetRole is invalid. (Service:
AmazonEC2; Status Code: 400; Error Code: InvalidSpotFleetRequestConfig; Request ID:
331205f0-5ae3-4cea-bac4-897769639f8d) Parameter: SpotFleetRequestConfig.IamFleetRole is
invalid
One common cause for this issue is the following scenario. You only specify the name of an
IAM role when using the AWS CLI or the AWS SDKs, instead of the full Amazon Resource Name
(ARN). Depending on how you created the role, the ARN might contain a
aws-service-role
path prefix. For example, if you manually create the AWS Batch
service role using the procedures in Using service-linked roles for AWS Batch, your service role ARN might look like the
following.
arn:aws:iam::123456789012
:role/AWSBatchServiceRole
However, if you created the service role as part of the console first run wizard today, your service role ARN might look like the following.
arn:aws:iam::123456789012
:role/aws-service-role/AWSBatchServiceRole
This issue can also occur if you attach the the AWS Batch service-level policy
(AWSBatchServiceRole
) to a non-service role. For example, you may receive an
error message that resembles the following in this scenario:
CLIENT_ERROR - User: arn:aws:sts::
account_number
:assumed-role/batch-replacement-role/aws-batch is not authorized to perform:action
on resource ...
To resolve this issue, do one of the following.
-
Use an empty string for the service role when you create the AWS Batch compute environment.
-
Specify the service role in the following format:
arn:aws:iam::
.account_number
:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch
When you only specify the name of an IAM role when using the AWS CLI or the AWS SDKs,
AWS Batch assumes that your ARN doesn't use the aws-service-role
path prefix.
Because of this, we recommend that you specify the full ARN for your IAM roles when you
create compute environments.
To repair a compute environment that's misconfigured this way, see Repairing an INVALID compute environment.
Repairing an INVALID
compute environment
When you have a compute environment in an INVALID
state, update it to repair
the invalid parameter. For an Incorrect role name or ARN, update the compute environment using the correct service
role.
To repair a misconfigured compute environment
-
Open the AWS Batch console at https://console.aws.amazon.com/batch/
. -
From the navigation bar, select the AWS Region to use.
-
In the navigation pane, choose Compute environments.
-
On the Compute environments page, select the radio button next to the compute environment to edit, and then choose Edit.
-
On the Update compute environment page, for Service role, choose the IAM role to use with your compute environment. The AWS Batch console only displays roles that have the correct trust relationship for compute environments.
-
Choose Save to update your compute environment.
Jobs stuck in a RUNNABLE
status
Suppose that your compute environment contains compute resources, but your jobs don't
progress beyond the RUNNABLE
status. Then, it's likely that something is preventing
the jobs from being placed on a compute resource and causing your job queues to be blocked.
Here's how to know if your job is waiting for its turn or stuck and blocking the
queue.
If AWS Batch detects that you have a RUNNABLE
job at the head and blocking the queue, you'll receive a blocked job
queue event from Amazon CloudWatch Events with the reason. The same reason is also updated into the
statusReason
field as a part of ListJobs
and DescribeJobs
API calls.
Optionally, you can configure the jobStateTimeLimitActions
parameter through
CreateJobQueue
and UpdateJobQueue
API
actions.
Note
Currently, the only action you can use with jobStateLimitActions.action
is to cancel a job.
The jobStateTimeLimitActions
parameter is used to specify a set of actions
that AWS Batch performs on jobs in a specific state. You can set a time threshold in seconds
through the maxTimeSeconds
field.
When a job has been in a RUNNABLE
state with the defined
statusReason
, AWS Batch performs the action specified after
maxTimeSeconds
have elapsed.
For example, you can set the
jobStateTimeLimitActions
parameter to wait up to 4 hours for any job in the
RUNNABLE
state that is waiting for sufficient capacity to become available. You
can do this by setting statusReason
to
CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
and maxTimeSeconds
to 144000
before cancelling the job and allowing the next job to advance to the head of the job
queue.
The following are the reasons that AWS Batch provides when it detects that a job queue is
blocked. This list provides the messages returned from the ListJobs
and
DescribeJobs
API actions. These are also the same values you can define for the
jobStateLimitActions.statusReason
parameter.
-
Reason: All connected compute environments have insufficient capacity errors. When requested, AWS Batch detects Amazon EC2 instances that experience insufficient capacity errors. Canceling the job, either manually or by setting the
jobStateTimeLimitActions
parameter onstatusReason
, allows the subsequent job to move to the head of the queue.-
statusReason
message while the job is stuck:CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY - Service cannot fulfill the capacity requested for instance type [instanceTypeName]
-
reason
used forjobStateTimeLimitActions
:CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
-
statusReason
message after the job is canceled:Canceled by JobStateTimeLimit action due to reason: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
Note:
-
The AWS Batch service role requires
autoscaling:DescribeScalingActivities
permission for this detection to work. If you use theAWSServiceRoleForBatch
service-linked role (SLR) or theAWSBatchServiceRolePolicy
managed policy, then you don’t need to take any action because their permission policies are updated. -
If you use the SLR or the managed policy, you must add the
autoscaling:DescribeScalingActivities
andec2:DescribeSpotFleetRequestHistory
permissions so that you can receive blocked job queue events and updated job status when inRUNNABLE
. In addition, AWS Batch needs these permissions to performcancellation
actions through thejobStateTimeLimitActions
parameter even if they are configured on the job queue. -
In the case of a multi-node parallel (MNP) job, if the attached high-priority, Amazon EC2 compute environment experiences
insufficient capacity
errors, it blocks the queue even if a lower priority compute environment does experience this error.
-
-
Reason: All compute environments have a
maxvCpus
parameter that is smaller than the job requirements. Canceling the job, either manually or by setting thejobStateTimeLimitActions
parameter onstatusReason
, allows the subsequent job to move to the head of the queue. Optionally, you can increase themaxvCpus
parameter of the primary compute environment to meet the needs of the blocked job.-
statusReason
message while the job is stuck:MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE - CE(s) associated with the job queue cannot meet the CPU requirement of the job.
-
reason
used forjobStateTimeLimitActions
:MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
-
statusReason
message after the job is canceled:Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
-
-
Reason: None of the compute environments have instances that meet the job requirements. When a job requests resources, AWS Batch detects that no attached compute environment is able to accommodate the incoming job. Canceling the job, either manually or by setting the
jobStateTimeLimitActions
parameter onstatusReason
, allows the subsequent job to move to the head of the queue. Optionally, you can redefine the compute environment's allowed instance types to add the necessary job resources.-
statusReason
message while the job is stuck:MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT - The job resource requirement (vCPU/memory/GPU) is higher than that can be met by the CE(s) attached to the job queue.
-
reason
used forjobStateTimeLimitActions
:MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
-
statusReason
message after the job is canceled:Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
-
-
Reason: All compute environments have service role issues. To resolve this, compare your service role permissions to the AWS Batch managed service role permissions and address any gaps.
It's a best practice to use the AWS Batch SLR for compute environments to avoid similar errors.
Canceling the job, either manually or by setting the
jobStateTimeLimitActions
parameter onstatusReason
, allows the subsequent job to move to the head of the queue. Without resolving the service role issue(s), it is likely that the next job will also be blocked as well. It's best to manually investigate and resolve this issue.-
statusReason
message while the job is stuck:MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS – Batch service role has a permission issue.
-
reason
used forjobStateTimeLimitActions
:MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS
-
statusReason
message after the job is canceled:Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS
-
-
Reason: All compute environments are invalid. For more information, see
INVALID
compute environment. Note: You can't configure a programmable action through thejobStateTimeLimitActions
parameter to resolve this error.-
statusReason
message while the job is stuck:ACTION_REQUIRED - CE(s) associated with the job queue are invalid.
-
-
Reason: AWS Batch has detected a blocked queue, but is unable to determine the reason. Note: You can't configure a programmable action through the
jobStateTimeLimitActions
parameter to resolve this error. For more information about troubleshooting, see Why is my AWS Batch job stuck in RUNNABLE on AWSin re:Post. -
statusReason
message while the job is stuck:UNDETERMINED - Batch job is blocked, root cause is undetermined.
-
In case you did not receive an event from CloudWatch Events or you received the unknown reason event, here are some common causes for this issue.
- The
awslogs
log driver isn't configured on your compute resources -
AWS Batch jobs send their log information to CloudWatch Logs. To enable this, you must configure your compute resources to use the
awslogs
log driver. Suppose that you base your compute resource AMI off of the Amazon ECS optimized AMI (or Amazon Linux). Then, this driver is registered by default with theecs-init
package. Now suppose that you use a different base AMI. Then, you must verify that theawslogs
log driver is specified as an available log driver with theECS_AVAILABLE_LOGGING_DRIVERS
environment variable when the Amazon ECS container agent is started. For more information, see Compute resource AMI specification and Creating a compute resource AMI. - Insufficient resources
-
If your job definitions specify more CPU or memory resources than your compute resources can allocate, then your jobs aren't ever placed. For example, suppose that your job specifies 4 GiB of memory, and your compute resources have less than that available. Then it's the case that the job can't be placed on those compute resources. In this case, you must reduce the specified memory in your job definition or add larger compute resources to your environment. Some memory is reserved for the Amazon ECS container agent and other critical system processes. For more information, see Compute Resource Memory Management.
- No internet access for compute resources
Compute resources need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your compute resources having public IP addresses.
For more information about interface VPC endpoints, see Amazon ECS Interface VPC Endpoints (AWS PrivateLink) in the Amazon Elastic Container Service Developer Guide.
If you do not have an interface VPC endpoint configured and your compute resources do not have public IP addresses, then they must use network address translation (NAT) to provide this access. For more information, see NAT gateways in the Amazon VPC User Guide. For more information, see Create a VPC.
- Amazon EC2 instance limit reached
-
The number of Amazon EC2 instances that your account can launch in an AWS Region is determined by your EC2 instance quota. Certain instance types also have a per-instance-type quota. For more information about your account's Amazon EC2 instance quota including how to request a limit increase, see Amazon EC2 Service Limits in the Amazon EC2 User Guide for Linux Instances.
- Amazon ECS container agent isn't installed
-
The Amazon ECS container agent must be installed on the Amazon Machine Image (AMI) to let AWS Batch run jobs. The Amazon ECS container agent is installed by default on Amazon ECS optimized AMIs. For more information about the Amazon ECS container agent, see Amazon ECS container agent in the Amazon Elastic Container Service Developer Guide.
For more information, see Why is my AWS Batch job stuck in RUNNABLE
status?
Spot Instances not tagged on creation
Spot Instance tagging for AWS Batch compute resources is supported as of October 25, 2017.
Before, the recommended IAM managed policy (AmazonEC2SpotFleetRole
) for the Amazon EC2
Spot Fleet role didn't contain permissions to tag Spot Instances at launch. The new recommended
IAM managed policy is called AmazonEC2SpotFleetTaggingRole
. It supports tagging
Spot Instances at launch.
To fix Spot Instance tagging on creation, follow the following procedure to apply the current recommended IAM managed policy to your Amazon EC2 Spot Fleet role. That way, any future Spot Instances that are created with that role have permissions to apply instance tags when they're created.
To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role
Open the IAM console at https://console.aws.amazon.com/iam/
. -
Choose Roles, and choose your Amazon EC2 Spot Fleet role.
-
Choose Attach policy.
-
Select the AmazonEC2SpotFleetTaggingRole and choose Attach policy.
-
Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.
-
Select the x to the right of the AmazonEC2SpotFleetRole policy, and choose Detach.
Spot Instances not scaling down
AWS Batch introduced the AWSServiceRoleForBatch service-linked role on
March 10, 2021. If no role is specified in the serviceRole
parameter of the compute
environment, this service-linked role is used as the service role. However, suppose that the
service-linked role is used in an EC2 Spot compute environment, but the Spot role used doesn't
include the AmazonEC2SpotFleetTaggingRole managed policy. Then, the Spot
Instance doesn't scale down. As a result, you will receive an error with the following message:
"You are not authorized to perform this operation." Use the following steps to update the spot
fleet role that you use in the spotIamFleetRole
parameter. For more information,
see Using
service-linked roles and Creating a role to delegate
permissions to an AWS Service in the IAM User Guide.
Topics
Attach AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role in the AWS Management Console
To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role
Open the IAM console at https://console.aws.amazon.com/iam/
. -
Choose Roles, and choose your Amazon EC2 Spot Fleet role.
-
Choose Attach policy.
-
Select the AmazonEC2SpotFleetTaggingRole and choose Attach policy.
-
Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.
-
Select the x to the right of the AmazonEC2SpotFleetRole policy, and choose Detach.
Attach AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role with the AWS CLI
The example commands assume that your Amazon EC2 Spot Fleet role is named
AmazonEC2SpotFleetRole
. If your role uses a different name, adjust
the commands to match.
To attach the AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role
-
To attach the AmazonEC2SpotFleetTaggingRole managed IAM policy to your
AmazonEC2SpotFleetRole
role, run the following command using the AWS CLI.$
aws iam attach-role-policy \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole \ --role-name
AmazonEC2SpotFleetRole
-
To detach the AmazonEC2SpotFleetRole managed IAM policy from your
AmazonEC2SpotFleetRole
role, run the following command using the AWS CLI.$
aws iam detach-role-policy \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetRole \ --role-name
AmazonEC2SpotFleetRole
Can't retrieve Secrets Manager secrets
If you use an AMI with an Amazon ECS agent that's earlier than version 1.16.0-1, then you must
use the Amazon ECS agent configuration variable
ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true
to use this feature. You can add it
to the ./etc/ecs/ecs.config
file to a new container instance when you create that
instance. Or, you can add it to an existing instance. If you add it to an existing instance, you
must restart the ECS agent after you add it. For more information, see Amazon ECS
Container Agent Configuration in the Amazon Elastic Container Service Developer Guide.
Can't override job definition resource requirements
The memory and vCPU overrides that are specified in the memory
and
vcpus
members of the containerOverrides
structure, which passed to SubmitJob, can't override the memory
and vCPU requirements that are specified in the resourceRequirements structure in the job definition.
If you try to override these resource requirements, you might see the following error message:
"This value was submitted in a deprecated key and may conflict with the value provided by the job definition's resource requirements."
To correct this, specify the memory and vCPU requirements in the resourceRequirements member of the containerOverrides. For example, if your memory and vCPU overrides are specified in the following lines.
"containerOverrides": { "memory":
8192
, "vcpus":4
}
Change them to the following:
"containerOverrides": { "resourceRequirements": [ { "type": "MEMORY", "value": "
8192
" }, { "type": "VCPU", "value": "4
" } ], }
Do the same change to the memory and vCPU requirements that are specified in the containerProperties object in the job definition. For example, if your memory and vCPU requirements are specified in the following lines.
{ "containerProperties": { "memory":
4096
, "vcpus":2
, }
Change them to the following:
"containerProperties": { "resourceRequirements": [ { "type": "MEMORY", "value": "
4096
" }, { "type": "VCPU", "value": "2
" } ], }
Error message when you update the
desiredvCpus
setting
You see the following error message when you use the AWS Batch API to update the desired
vCPUs (desiredvCpus
) setting.
Manually scaling down compute environment is not supported. Disconnecting job queues
from compute environment will cause it to scale-down to minvCpus
.
This issue occurs if the updated desiredvCpus
value is less than the current
desiredvCpus
value. When you update the desiredvCpus
value, both of
the following must be true:
-
The
desiredvCpus
value must be between theminvCpus
andmaxvCpus
values. -
The updated
desiredvCpus
value must be greater than or equal to the currentdesiredvCpus
value.
AWS Batch on Amazon EKS
Topics
INVALID
compute
environment
It's possible that you might have incorrectly configured a managed compute environment. If
you did, the compute environment enters an INVALID
state and can't accept jobs for
placement. The following sections describe the possible causes and how to troubleshoot based on
the cause.
Unsupported Kubernetes version
You might see an error message that resembles the following when you use the
CreateComputeEnvironment
API operation or
UpdateComputeEnvironment
API operation to create or update a compute environment.
This issue occurs if you specify an unsupported Kubernetes version in
EC2Configuration
.
At least one imageKubernetesVersion in EC2Configuration is not supported.
To resolve this issue, delete the compute environment and then re-create it with a supported Kubernetes version.
You can perform a minor version upgrade on your Amazon EKS cluster. For example, you can
upgrade the cluster from 1.xx
to 1.yy
even if the minor version isn't
supported.
However, the compute environment status might change to INVALID
after a major
version update. For example, if you perform a major version upgrade from 1.xx
to
2.yy
. If the major version isn't supported by AWS Batch, you see an error message
that resembles the following.
reason=CLIENT_ERROR - ...
EKS
Cluster version [2.yy
] is unsupported
To resolve this issue, specify a supported Kubernetes version when you use an API operation to create or update a compute environment.
AWS Batch on Amazon EKS currently supports the following Kubernetes versions:
-
1.29
-
1.28
-
1.27
-
1.26
-
1.25
-
1.24
-
1.23
Instance profile doesn't exist
If the specified instance profile does not exist, the AWS Batch on Amazon EKS compute environment
status is changed to INVALID
. You see an error set in the
statusReason
parameter that resembles the following.
CLIENT_ERROR - Instance profile arn:aws:iam::...:instance-profile/
<name>
does not exist
To resolve this issue, specify or create a working instance profile. For more information, see Amazon EKS node IAM role in the Amazon EKS User Guide.
Invalid Kubernetes namespace
If AWS Batch on Amazon EKS can't validate the namespace for the compute environment, the compute
environment status is changed to INVALID
. For example, this issue can occur if the
namespace doesn't exist.
You see an error message set in the statusReason
parameter that resembles the
following.
CLIENT_ERROR - Unable to validate Kubernetes Namespace
This issue can occur if any of the following are true:
-
The Kubernetes namespace string in the
CreateComputeEnvironment
call doesn't exist. For more information, see CreateComputeEnvironment. -
The required Role-Based Access Control (RBAC) permissions to manage the namespace are not configured correctly.
-
AWS Batch doesn't have access to the Amazon EKS Kubernetes API server endpoint.
To resolve this issue, see Verify that the aws-auth ConfigMap is configured correctly. For more information, see Getting started with AWS Batch on Amazon EKS .
Deleted compute environment
Suppose that you delete an Amazon EKS cluster before you delete the attached AWS Batch on Amazon EKS
compute environment. Then, the compute environment status is changed to INVALID
.
In this scenario, the compute environment doesn't work properly if you re-create the Amazon EKS
cluster with the same name.
To resolve this issue, delete and then re-create the AWS Batch on Amazon EKS compute environment.
Nodes don't join the Amazon EKS cluster
AWS Batch on Amazon EKS scales down a compute environment if it determines that not all nodes
joined the Amazon EKS cluster. When AWS Batch on Amazon EKS scales down the compute environment, the
compute environment status is changed to INVALID
.
Note
AWS Batch doesn't change the compute environment status immediately so that you can debug the issue.
You see an error message set in the statusReason
parameter that resembles
ones of the following:
Your compute environment has been INVALIDATED and scaled down because none of the
instances joined the underlying ECS Cluster. Common issues preventing instances joining are
the following: VPC/Subnet configuration preventing communication to ECS, incorrect Instance
Profile policy preventing authorization to ECS, or customized AMI or LaunchTemplate
configurations affecting ECS agent.
Your compute environment has been INVALIDATED and scaled down because none of the
nodes joined the underlying Amazon EKS Cluster. Common issues preventing nodes joining are the
following: networking configuration preventing communication to Amazon EKS Cluster, incorrect Amazon EKS
Instance Profile or Kubernetes RBAC policy preventing authorization to Amazon EKS Cluster, customized
AMI or LaunchTemplate configurations affecting Amazon EKS/Kubernetes node bootstrap.
When using a default Amazon EKS AMI, the most common causes of this issue are the following:
-
The instance role isn't configured correctly. For more information, see Amazon EKS node IAM role in the Amazon EKS User Guide.
-
The subnets aren't configured correctly. For more information, see Amazon EKS VPC and subnet requirements and considerations in the Amazon EKS User Guide.
-
The security group isn't configured correctly. For more information, see Amazon EKS security group requirements and considerations in the Amazon EKS User Guide.
Note
You may also see an error notification in the Personal Health Dashboard (PHD).
AWS Batch on Amazon EKS job is stuck in
RUNNABLE
status
An aws-auth
ConfigMap
is automatically created and applied to your cluster when you create a
managed node group or a node group using eksctl
. An aws-auth
ConfigMap
is initially created to allow nodes to join your cluster. However, you
also use the aws-auth
ConfigMap
to add role-based access control (RBAC)
access to users and roles.
To verify that the aws-auth
ConfigMap
is configured correctly:
-
Retrieve the mapped roles in the
aws-auth
ConfigMap
:$
kubectl get configmap -n kube-system aws-auth -o yaml
-
Verify that the
roleARN
is configured as follows.rolearn: arn:aws:iam::
aws_account_number
:role/AWSServiceRoleForBatchNote
You can also review the Amazon EKS control plane logs. For more information, see Amazon EKS control plane logging in the Amazon EKS User Guide.
To resolve an issue where a job is stuck in a RUNNABLE
status, we recommend
that you use kubectl
to re-apply the manifest. For more information, see Step 1: Preparing your Amazon EKS cluster for
AWS Batch. Or, you can
use kubectl
to manually edit the aws-auth
ConfigMap
. For more information, see Enabling IAM user and role access to your
cluster in the Amazon EKS User Guide.
Verify that the aws-auth ConfigMap
is
configured correctly
To verify that the aws-auth
ConfigMap
is configured correctly:
-
Retrieve the mapped roles in the
aws-auth
ConfigMap
.$
kubectl get configmap -n kube-system aws-auth -o yaml
-
Verify that the
roleARN
is configured as follows.rolearn: arn:aws:iam::
aws_account_number
:role/AWSServiceRoleForBatchNote
The path
aws-service-role/batch.amazonaws.com/
has been removed from the ARN of the service-linked role. This is because of an issue with theaws-auth
configuration map. For more information, see Roles with paths do not work when the path is included in their ARN in the aws-authconfigmap. Note
You can also review the Amazon EKS control plane logs. For more information, see Amazon EKS control plane logging in the Amazon EKS User Guide.
To resolve an issue where a job is stuck in a RUNNABLE
status, we recommend
that you use kubectl
to re-apply the manifest. For more information, see Step 1: Preparing your Amazon EKS cluster for
AWS Batch. Or, you can
use kubectl
to manually edit the aws-auth
ConfigMap
. For more information, see Enabling IAM user and role access to your
cluster in the Amazon EKS User Guide.
RBAC permissions or bindings aren't configured properly
If you experience any RBAC permissions or binding issues, verify that the
aws-batch
Kubernetes role can access the Kubernetes namespace:
$
kubectl get namespace
namespace
--as=aws-batch
$
kubectl auth can-i get ns --as=aws-batch
You can also use the kubectl describe
command to view the
authorizations for a cluster role or Kubernetes namespace.
$
kubectl describe clusterrole
aws-batch-cluster-role
The following is example output.
Name: aws-batch-cluster-role
Labels: <none>
Annotations: <none>
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
configmaps [] [] [get list watch]
nodes [] [] [get list watch]
pods [] [] [get list watch]
daemonsets.apps [] [] [get list watch]
deployments.apps [] [] [get list watch]
replicasets.apps [] [] [get list watch]
statefulsets.apps [] [] [get list watch]
clusterrolebindings.rbac.authorization.k8s.io [] [] [get list]
clusterroles.rbac.authorization.k8s.io [] [] [get list]
namespaces [] [] [get]
$
kubectl describe role
aws-batch-compute-environment-role
-nmy-aws-batch-namespace
The following is example output.
Name: aws-batch-compute-environment-role
Labels: <none>
Annotations: <none>
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
pods [] [] [create get list watch delete patch]
serviceaccounts [] [] [get list]
rolebindings.rbac.authorization.k8s.io [] [] [get list]
roles.rbac.authorization.k8s.io [] [] [get list]
To resolve this issue, re-apply the RBAC permissions and rolebinding
commands.
For more information, see Step 1: Preparing your Amazon EKS cluster for
AWS Batch.