How Slurm works on AWS PCS Retrieve instance logs Retrieve VPC/Subnet/Security Groups from an instance ID Node registration problems Slurm cluster join problems

Troubleshoot compute node bootstrap and registration problems in AWS PCS

When compute nodes fail to bootstrap or register properly with your AWS PCS cluster, you might experience the following symptoms:

Jobs don't start
You can't connect to instances in AWS Systems Manager
Instances shut down unexpectedly
Instances are continuously replaced

These failures can be caused by problems during EC2 instance launch or during the AWS PCS compute node bootstrap process. This topic describes procedures to help you troubleshoot problems during the AWS PCS node bootstrap process. For more information about troubleshooting EC2 instance launch, see Troubleshoot Amazon EC2 instance launch problems in the Amazon Elastic Compute Cloud User Guide.

Bootstrap failures occur when an EC2 instance launches successfully but fails during the process of joining the AWS PCS cluster. The bootstrap process includes two main phases:

Node registration – The EC2 instance calls the RegisterComputeNodeGroupInstance AWS PCS API action to register with the AWS PCS service. Failures can occur due to problems in the following:
- Permissions
  - Wrong instance profile
- Networking
Slurm integration – The instance runs slurmd and joins the Slurm cluster. Failures can occur due to problems in the following:
- Permissions
  - Security group configuration
  - Slurmctld unable to ping compute node
- Custom AMI setup
  - Missing NVIDIA drivers
  - ResumeTimeout reached

How Slurm works on AWS PCS

It might help you to compare the standard way Slurm works to the way Slurm works on AWS PCS.

Standard Slurm job processing

The following steps occur in standard Slurm job processing:

When you submit a job, slurmctld validates and queues the job.
When resources become available, slurmctld allocates existing nodes.
slurmd daemons run jobs on allocated nodes.

Slurm job processing on AWS PCS

The following steps occur in AWS PCS job processing:

When you submit a job, slurmctld validates and queues the job.
When additional capacity is needed, AWS PCS uses the launch template for the compute node group to launch new EC2 instances.
New instances bootstrap into the cluster:
1. Instances register with AWS PCS.
2. Instances join the Slurm cluster.
When resources are ready, slurmctld allocates nodes (including newly bootstrapped ones).
slurmd daemons run jobs on allocated nodes.

Retrieve instance logs

The first step in troubleshooting compute node bootstrap problems is to retrieve the instance logs. You can use one of the following methods:

Retrieve VPC/Subnet/Security Groups from an instance ID

To troubleshoot problems with your compute nodes, you might need to retrieve information about the VPC, subnet, and security groups associated with your instances. If you don't know your instance IDs, see Finding compute node group instances in AWS PCS.

Node registration problems

Node registration is the first action executed by a compute node during bootstrap. The node calls the AWS PCS API endpoint to register itself with the AWS PCS Node Group Manager database. Registration failures usually show error messages similar to the following:

<13>Nov  5 08:10:27 user-data: Recipe: aws-pcs-environment::node_registration  
<13>Nov  5 08:10:27 user-data:   * ruby_block[Register NodeGroup Instance] action run[2024-11-05T08:10:27+00:00] INFO: Processing ruby_block[Register NodeGroup Instance] action run (aws-pcs-environment::node_registration line 19)  
<13>Nov  5 08:15:46 user-data:   
<13>Nov  5 08:15:46 user-data:       
<13>Nov  5 08:15:46 user-data:     ================================================================================  
<13>Nov  5 08:15:46 user-data:     Error executing action `run` on resource 'ruby_block[Register NodeGroup Instance]'  
<13>Nov  5 08:15:46 user-data:     ================================================================================  
<13>Nov  5 08:15:46 user-data:       
<13>Nov  5 08:15:46 user-data:     EOFError

Wrong instance profile

If the instance is unable to register, verify that the instance profile associated with the compute node has the pcs:RegisterComputeNodeGroupInstance permission.

For more information about how to create a valid instance profile, see Create an instance profile for AWS PCS.

Missing AWS PCS endpoints

Verify that your VPC has the necessary AWS PCS service endpoints configured. For more information about AWS PCS service endpoints, see Endpoints and service quotas for AWS PCS.

If your compute nodes are in a private subnet, ensure that you have configured VPC endpoints for AWS PCS or that your subnet has a route to a NAT gateway for internet access.

Misconfigured AWS PCS endpoint

If you see an error message similar to the following, verify the policy associated with your AWS PCS VPC endpoint:

com.amazon.coral.security.AccessDeniedException: User: arn:aws:sts::xxx:assumed-role/rolename/i-instanceid is not authorized to perform: pcs:RegisterComputeNodeGroupInstance on resource: arn:aws:pcs:us-west-2:xxx:cluster/cluster-id as either the resource does not exist, some policy explicitly denies access, or no policy grants access

For more information about how to configure VPC interface endpoints for AWS PCS, see Access AWS Parallel Computing Service using an interface endpoint (AWS PrivateLink).

Instance in a public subnet without public IP

If your subnet doesn't have auto-assign public IP enabled and your route configuration uses an internet gateway, instances can't communicate with the AWS PCS API.

Instances in a subnet with an internet gateway must have a public IP address. To resolve this issue, choose one of the following options:

Add a VPC endpoint for AWS PCS to your cluster VPC. This enables instances to communicate with AWS PCS without the need for a public IP address to pass through the internet gateway.
Use a private subnet with a NAT gateway, so that a public IP address is not required.
Enable automatic public IP address assignment through your subnet or launch template so that instances can contact the API through the internet gateway. Note that this option is not valid for multi-network interface instances.

Multi-NIC instance in a public subnet

You must use a private subnet if you use an instance type that has multiple network interfaces (NICs).

AWS public IP addresses can only be assigned to instances launched with a single network interface. For more information about IP addresses, see Assign a public IPv4 address during instance launch in the Amazon EC2 User Guide for Linux Instances.

Multi-NIC instance types require a NAT gateway or an internal proxy in the subnet to access the AWS PCS endpoint. Alternatively, you can add a VPC endpoint for AWS PCS to your cluster VPC.

Slurm cluster join problems

After successful node registration, the compute node attempts to join the Slurm cluster. The slurmd daemon on the node contacts the Slurm controller to register with the cluster. Slurm join failures usually show error messages similar to the following:

<13>Nov  5 17:20:29 user-data: [2024-11-05T17:20:28+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: service[slurmd] (aws-pcs-slurm::finalize_slurm line 18) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'  
<13>Nov  5 17:20:29 user-data: ---- Begin output of ["/usr/bin/systemctl", "--system", "start", "slurmd"] ----  
<13>Nov  5 17:20:29 user-data: STDOUT:   
<13>Nov  5 17:20:29 user-data: STDERR: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.  
<13>Nov  5 17:20:29 user-data: ---- End output of ["/usr/bin/systemctl", "--system", "start", "slurmd"] ----

Security group configuration

Verify that your security groups are configured correctly to allow communication between compute nodes and the Slurm controller. The security groups must allow the following traffic:

Port 6817 for slurmd to communicate with slurmctld
Port 6818 for slurmctld to ping slurmd

For more information about security group requirements, see the following topics:

Important

The cluster security group that you associated with your cluster during cluster creation must also be configured in your compute node group security groups to allow compute nodes to communicate with the controller.

Missing NVIDIA drivers

If the instance bootstraps correctly but jobs don't start, and you see error messages similar to the following in your instance logs, you might be missing NVIDIA drivers:

<13>Dec  2 13:52:00 user-data: [2024-12-02T13:52:00.094+00:00] - /opt/aws/pcs/bin/pcs_bootstrap_config_always.sh: INFO: nvidia-smi not found!  
...  
<13>Dec  2 13:54:10 user-data: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.  
<13>Dec  2 13:54:12 user-data: [2024-12-02T13:54:12.718+00:00] - /opt/aws/pcs/bin/pcs_bootstrap_finalize.sh: INFO: systemctl could not start slurmd!

If you connect to the instance and check the slurmd daemon status, you might see an error similar to the following:

$ systemctl status slurmd  
...  
fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

To resolve this issue, install NVIDIA drivers on your custom AMI. For more information, see Step 4 – (Optional) Install additional drivers, libraries, and application software.

ResumeTimeout reached

If a compute node and its EC2 instance are terminated because the node is unhealthy, AWS PCS might not support the AMI or there might be network problems. The EC2 instance runs for approximately 30 minutes until Slurm's ResumeTimeout is reached and marks the node as DOWN.

If the instance doesn't bootstrap correctly and isn't registered with AWS PCS (no RegisterComputeNodeGroupInstance call for the EC2 instance), check your instance logs for error messages similar to the following:

/opt/aws/pcs/bin/pcs_bootstrap_init.sh: No such file or directory

This error indicates that the AWS PCS bootstrap software is not part of the AMI. To resolve this issue, ensure that your custom AMI includes the AWS PCS bootstrap software. For more information, see Custom Amazon Machine Images (AMIs) for AWS PCS.

Slurmctld unable to ping compute node

If the instance correctly executes the bootstrap procedure and is registered with AWS PCS, but slurmctld is unable to see it and submit jobs to it, the instance is set to DOWN after some time and then terminated.

This might be caused by misconfigured security groups. For example, if port 6817 is enabled to allow slurmd to communicate with slurmctld, but port 6818 is missing to allow slurmctld to ping slurmd.

Verify that your security groups include all required rules as documented in Security group requirements and considerations.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

EC2 instance is terminated and replaced after reboot

Document history