AWS ParallelCluster
AWS ParallelCluster User Guide

AWS ParallelCluster Troubleshooting

The AWS ParallelCluster community maintains a wiki with many troubleshooting tips at the aws-parallelcluster wiki.

Failure submitting AWS Batch multi-node parallel jobs

If you have problems submitting multi-node parallel jobs when using AWS Batch as the job scheduler, it's recommend to upgrade to AWS ParallelCluster 2.5.0. If that is not feasible, a workaround can be found at Self patch a Cluster Used for Submitting Multi node Parallel Jobs through AWS Batch.

Placement Groups and Instance Launch Issues

In order to get the lowest inter-node latency, AWS recommends that you use a placement group. A placement group guarantees that your instances will be on the same networking backbone. If not enough instances are available when the request is made, an InsufficientInstanceCapacity error is returned. To reduce the possibility of receiving an InsufficientInstanceCapacity error when using cluster placement groups, set the placement_group parameter to DYNAMIC and set the placement parameter to compute.

If a high performance shared filesystem is needed, consider using Amazon FSx for Lustre.

If the master node must be in the placement group, use the same instance type and subnet for both the master and compute nodes. In other words, the compute_instance_type parameter has the same value as the master_instance_type parameter, the placement parameter is set to cluster and the compute_subnet_id parameter is not specified (the value of the master_subnet_id parameter is used for the compute notes.)

For more information, see Troubleshooting Instance Launch issues and Placement Groups Roles and Limitations in the Amazon EC2 User Guide for Linux Instances