AWS ParallelCluster troubleshooting
The AWS ParallelCluster community maintains a Wiki with many troubleshooting tips
at the aws-parallelcluster wiki
Failure submitting AWS Batch multi-node parallel jobs
If you have problems submitting multi-node parallel jobs when using AWS Batch as the
job scheduler, we recommend
that you upgrade to AWS ParallelCluster version 2.5.0. If that is not feasible, you
can use a workaround. For
information about this workaround, see Self patch a cluster used for submitting multi node parallel jobs through AWS Batch
Placement groups and instance launch issues
To get the lowest inter-node latency, we recommend that you use a placement
group. A placement group guarantees that your instances are on the same networking backbone.
If not
enough instances are available when the request is made, an InsufficientInstanceCapacity
error is
returned. To reduce the possibility of receiving an InsufficientInstanceCapacity
error when using
cluster placement groups, set the placement_group parameter to DYNAMIC
and set the placement parameter to compute
.
If you need a high performance shared filesystem, consider using Amazon FSx for Lustre
If the head node must be in the placement group, use the same instance type and subnet
for the head and all
compute nodes. As such, the compute_instance_type parameter
has the same value as the master_instance_type parameter, the
placement parameter is set to cluster
, and the compute_subnet_id parameter is not specified. With this configuration,
the value of the master_subnet_id parameter is used for the
compute nodes.
For more information, see Troubleshooting instance launch issues and Placement groups roles and limitations in the Amazon EC2 User Guide for Linux Instances
Directories that cannot be replaced
The following directories are shared between the nodes and cannot be replaced.
/home
-
This includes the default user home folder (
/home/ec2_user
on Amazon Linux,/home/centos
on CentOS, and/home/ubuntu
on Ubuntu). /opt/intel
-
This includes Intel MPI, Intel Parallel Studio, and related files.
/opt/sge
-
This includes Son of Grid Engine and related files. (Conditional, only if scheduler
= sge
.) /opt/slurm
-
This includes Slurm Workload Manager and related files. (Conditional, only if scheduler
= slurm
.) /opt/torque
-
This includes Torque Resource Manager and related files. (Conditional, only if scheduler
= torque
.)
NICE DCV troubleshooting
The logs for NICE DCV are written to files in the /var/log/dcv/
directory. Reviewing these logs can
help to troubleshoot problems.
The instance type should have at least 1.7 GiB of RAM to run NICE DCV. Nano and micro instance types don't have enough memory to run NICE DCV.