Head node issues Compute issues Job failures Connect timeout on endpoint URL error

Troubleshooting issues in clusters with AWS Batch integration

This section is relevant to clusters with AWS Batch scheduler integration.

Topics

Head node issues
Compute issues
Job failures
Connect timeout on endpoint URL error

Head node issues

You can troubleshoot head node setup issues in the same way as a Slurm cluster (except for Slurm specific logs). For more information about these issues, see Head node.

Compute issues

AWS Batch manages the scaling and compute aspects of your services. If you encounter compute related issues, see the AWS Batch troubleshooting documentation for help.

Job failures

If a job fails, you can run the awsbout command to retrieve the job output. You can also run the awsbstat command to obtain a link to the job logs stored by Amazon CloudWatch.

Connect timeout on endpoint URL error

If multi-node parallel jobs fail with error: Connect timeout on endpoint URL:

In the awsbout output log, check that the job is multi-node parallel from the output: Detected 3/3 compute nodes. Waiting for all compute nodes to start.
Verify whether the compute nodes subnet is public.

Multi-node parallel jobs don't support the use of public subnets when using AWS Batch in AWS ParallelCluster. Use a private subnet for your compute nodes and jobs. For more information, see Compute environment considerations in the AWS Batch User Guide. To configure a private subnet for your compute nodes, see AWS ParallelCluster with AWS Batch scheduler.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Troubleshooting issues in NICE DCV

Troubleshooting multi-user integration with Active Directory