AWS ParallelCluster troubleshooting - AWS ParallelCluster

AWS ParallelCluster troubleshooting

The AWS ParallelCluster community maintains a Wiki with many troubleshooting tips at the aws-parallelcluster wiki.

Failure submitting AWS Batch multi-node parallel jobs

If you have problems submitting multi-node parallel jobs when using AWS Batch as the job scheduler, we recommend that you upgrade to AWS ParallelCluster version 2.5.0. If that is not feasible, you can use a workaround. For information about this workaround, see Self patch a cluster used for submitting multi node parallel jobs through AWS Batch.

Placement groups and instance launch issues

To get the lowest inter-node latency, we recommend that you use a placement group. A placement group guarantees that your instances are on the same networking backbone. If not enough instances are available when the request is made, an InsufficientInstanceCapacity error is returned. To reduce the possibility of receiving an InsufficientInstanceCapacity error when using cluster placement groups, set the placement_group parameter to DYNAMIC and set the placement parameter to compute.

If you need a high performance shared filesystem, consider using Amazon FSx for Lustre.

If the head node must be in the placement group, use the same instance type and subnet for the head and all compute nodes. As such, the compute_instance_type parameter has the same value as the master_instance_type parameter, the placement parameter is set to cluster, and the compute_subnet_id parameter is not specified. With this configuration, the value of the master_subnet_id parameter is used for the compute nodes.

For more information, see Troubleshooting instance launch issues and Placement groups roles and limitations in the Amazon EC2 User Guide for Linux Instances

Directories that cannot be replaced

The following directories are shared between the nodes and cannot be replaced.


This includes the default user home folder (/home/ec2_user on Amazon Linux, /home/centos on CentOS, and /home/ubuntu on Ubuntu).


This includes Intel MPI, Intel Parallel Studio, and related files.


This includes Son of Grid Engine and related files. (Conditional, only if scheduler = sge.)


This includes Slurm Workload Manager and related files. (Conditional, only if scheduler = slurm.)


This includes Torque Resource Manager and related files. (Conditional, only if scheduler = torque.)

NICE DCV troubleshooting

The logs for NICE DCV are written to files in the /var/log/dcv/ directory. Reviewing these logs can help to troubleshoot problems.

The instance type should have at least 1.7 GiB of RAM to run NICE DCV. Nano and micro instance types don't have enough memory to run NICE DCV.