AWS ParallelCluster Troubleshooting - AWS ParallelCluster

AWS ParallelCluster Troubleshooting

The AWS ParallelCluster community maintains a Wiki with many troubleshooting tips at the aws-parallelcluster wiki.

Failure submitting AWS Batch multi-node parallel jobs

If you have problems submitting multi-node parallel jobs when using AWS Batch as the job scheduler,we recommend that you upgrade to AWS ParallelCluster 2.5.0. If that is not feasible, you can use a workaround. For information, see Self patch a Cluster Used for Submitting Multi node Parallel Jobs through AWS Batch.

Placement Groups and Instance Launch Issues

To get the lowest inter-node latency, we recommend that you use a placement group. A placement group guarantees that your instances are on the same networking backbone. If not enough instances are available when the request is made, an InsufficientInstanceCapacity error is returned. To reduce the possibility of receiving an InsufficientInstanceCapacity error when using cluster placement groups, set the placement_group parameter to DYNAMIC and set the placement parameter to compute.

If a high performance shared filesystem is needed, consider using Amazon FSx for Lustre.

If the master node must be in the placement group, use the same instance type and subnet for both the master and compute nodes. In other words, the compute_instance_type parameter has the same value as the master_instance_type parameter, the placement parameter is set to cluster, and the compute_subnet_id parameter is not specified. This means that the value of the master_subnet_id parameter is used for the compute nodes.

For more information, see Troubleshooting Instance Launch issues and Placement Groups Roles and Limitations in the Amazon EC2 User Guide for Linux Instances

Directories that cannot be replaced

The following directories are shared between the nodes and cannot be replaced.

/home

This includes the default user home folder (/home/ec2_user on Amazon Linux, /home/centos on CentOS , and /home/ubuntu on Ubuntu .)

/opt/intel

This includes Intel MPI, Intel Parallel Studio, and related files.

/opt/sge

This includes Son of Grid Engine and related files. (Conditional, only if scheduler = sge.)

/opt/slurm

This includes Slurm Workload Manager and related files. (Conditional, only if scheduler = slurm.)

/opt/torque

This includes Torque Resource Manager and related files. (Conditional, only if scheduler = torque.)

NICE DCV troubleshooting

The logs for NICE DCV are written to files in the /var/log/dcv/ directory. Reviewing these logs can help to troubleshoot problems.

The instance type should have at least 1.7 GiB of RAM to run NICE DCV. Nano and micro instance types do not have enough memory to run NICE DCV.