AWS ParallelCluster processes
This section applies to clusters that are deployed with Slurm. When used with this scheduler, AWS ParallelCluster manages compute node provisioning and removal by interacting with the underlying job scheduler.
For HPC clusters that are based on AWS Batch, AWS ParallelCluster relies on the capabilities provided by the AWS Batch for the compute node management.
clustermgtd
The following tasks are performed by the cluster management daemon.
-
Inactive partition clean-up
-
Management of Slurm reservations and nodes associated to Capacity Blocks (see following section)
-
Static capacity management: make sure static capacity is always up and healthy
-
Sync scheduler with Amazon EC2 .
-
Orphaned instance clean-up
-
Restore scheduler node status on Amazon EC2 termination that happens outside of the suspend workflow
-
Unhealthy Amazon EC2 instances management (failing Amazon EC2 health checks)
-
Scheduled maintenance events management
-
Unhealthy Scheduler nodes management (failing Scheduler health checks)
Management of Slurm reservations and nodes associated to Capacity Blocks
ParallelCluster supports On-Demand Capacity Reservations (ODCR) and Capacity Blocks for Machine Learning (CB). Unlike ODCR, CB can have a future start time and is time-bound.
Clustermgtd searches for unhealthy nodes in a loop and terminates any Amazon EC2 instances that are down, replacing them with new instances if they are static nodes.
ParallelCluster manages static nodes associated with Capacity Blocks differently. AWS ParallelCluster creates a cluster even if the CB is not yet active, and instances are launched automatically once the CB is active.
The Slurm nodes corresponding to compute resources associated with CBs that are not yet active are kept in maintenance until the CB start time is reached. Slurm nodes will remain in a reservation/maintenance state associated with the Slurm admin user, which means they can accept jobs, but jobs will remain in pending until the Slurm reservation is removed.
Clustermgtd will automatically create/delete Slurm reservations, putting the related CB nodes in maintenance based on the CB state. When CB will be active, Slurm reservation will be removed, nodes will start and will become available for the pending jobs or for new jobs submissions.
When the CB end time is reached, nodes will be moved back to a reservation/maintenance state. It’s up to users to resubmit/requeue the jobs to a new queue/compute-resource when CB is no longer active and instances are terminated.
clusterstatusmgtd
The cluster status management daemon manages the compute fleet status update. Every minute it fetches the fleet status stored in a DynamoDB table and manages any STOP/START request.
computemgtd
Compute management daemon (computemgtd
) processes run on each of the cluster compute nodes.
Every five (5) minutes, the compute management daemon confirms that the head node can be reached and is healthy.
If five (5) minutes pass during which the head node cannot be reached or is not healthy, the compute node is shut down.