Troubleshooting Slurm reboot issues in AWS PCS

When you encounter node reboot problems, first check the node status using scontrol show node nodename. Then examine CloudWatch logs for both Slurm (slurmctld and slurmd) and system logs to identify potential errors.

For basic troubleshooting, verify network connectivity, check security group settings, and ensure all required services are running after the reboot. If problems persist after basic troubleshooting steps, contact AWS Support. When reaching out to support, provide relevant log excerpts, node status information, and a timeline of the reboot attempt to help expedite the resolution process.

Additional resources

For monitoring AWS PCS instances using CloudWatch, see Monitoring AWS PCS instances using Amazon CloudWatch.
For general troubleshooting, see Troubleshooting problems in AWS Parallel Computing Service.
For Slurm documentation, see Slurm Troubleshooting Guide.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

FAQ

Custom Slurm settings