Slurm cluster fast insufficient capacity fail-over
Starting with AWS ParallelCluster version 3.2.0, clusters run with the fast insufficient capacity fail-over mode enabled by default. This minimizes the time spent retrying to queue a job when Amazon EC2 insufficient capacity errors are detected. This is particularly effective when you configure your cluster with multiple kinds of instance types.
Amazon EC2 detected insufficient capacity failures:
-
InsufficientInstanceCapacity
-
InsufficientHostCapacity
-
InsufficientReservedInstanceCapacity
-
MaxSpotInstanceCountExceeded
-
SpotMaxPriceTooLow
: Activated if the Spot request price is lower than the minimum required Spot request fulfillment price. -
Unsupported
: Activated with the use of an instance type that isn't supported in a specific AWS Region.
In fast insufficient capacity failure-over mode, if an insufficient capacity error is detected when a job is assigned to a SlurmQueues / compute resource, AWS ParallelCluster does the following:
-
It sets the compute resource to a disabled (
DOWN
) state for a predefined period of time. -
It uses
POWER_DOWN_FORCE
to cancel the compute resource failing node jobs and to suspend the failing node. It sets the failing node to theIDLE
andPOWER_DOWN (!)
state, and then toPOWERING_DOWN (%)
. -
It requeues the job to another compute resource.
The static and powered up nodes of the disabled compute resource aren't impacted. Jobs can complete on these nodes.
This cycle repeats until the job is successfully assigned to a compute resource node or nodes. For information about node states, see the Slurm guide for multiple queue mode.
If no compute resources are found to run the job, the job is set to the PENDING
state until the predefined period of time elapses.
In this case, you can modify the predefined period of time as described in the following section.
Insufficient capacity timeout parameter
insufficient_capacity_timeout
insufficient_capacity_timeout
specifies the period of time (in seconds) that the compute resource is kept in the disabled
(down
) state when an insufficient capacity error is detected.
By default, insufficient_capacity_timeout
is enabled.
The default insufficient_capacity_timeout
is 600 seconds (10 minutes).
If the insufficient_capacity_timeout
value is less than or equal to zero, fast insufficient capacity failure-over mode is
disabled.
You can change the insufficient_capacity_timeout
value by adding the parameter
in the clustermgtd
config file located at
/etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
in the
HeadNode
.
The parameter can be updated at any time without stopping the compute fleet.
For example:
-
insufficient_capacity_timeout=600
:If an insufficient capacity error is detected, the compute resource is set to a disabled (
DOWN
). After 10 minutes, its failed node is set to theidle~
(POWER_SAVING
) state. -
insufficient_capacity_timeout=60
:If an insufficient capacity error is detected, the compute resource is in a disabled (
DOWN
). After 1 minute, its failed node is set to theidle~
state. -
insufficient_capacity_timeout=0
:Fast insufficient capacity failure-over mode is disabled. The compute resource isn't disabled.
Note
There might be a delay of up to one minute between the time when nodes fail with insufficient capacity errors and the time when the cluster
management daemon detects the node failures. This is because the cluster management daemon checks for node insufficient capacity failures and sets
the compute resources to the down
state at one-minute intervals.
Fast insufficient capacity fail-over mode status
When a cluster is in fast insufficient capacity fail-over mode, you can check its status and node states.
Node states
When a job is submitted to a compute resource dynamic node and an insufficient capacity error is detected, the node is placed in the
down#
state with reason.
(Code:InsufficientInstanceCapacity)Failure when resuming nodes.
Then powered off nodes (nodes in idle~
state) are set to down~
with reason.
(Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity.
The job is requeued to other compute resources in the queue.
The compute resource static nodes and nodes that are UP
aren't impacted by
fast insufficient capacity fail-over mode.
Consider the node states shown in the following example.
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1* up infinite 30 idle~ queue1-dy-c-1-[1-15],queue1-dy-c-2-[1-15] queue2 up infinite 30 idle~ queue2-dy-c-1-[1-15],queue2-dy-c-2-[1-15]
We submit a job to queue1 that requires one node.
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1* up infinite 1 down# queue1-dy-c-1-1 queue1* up infinite 15 idle~ queue1-dy-c-2-[1-15] queue1* up infinite 14 down~ queue1-dy-c-1-[2-15] queue2 up infinite 30 idle~ queue2-dy-c-1-[1-15],queue2-dy-c-2-[1-15]
Node queue1-dy-c-1-1
is launched to run the job. However, the instance failed
to launch due to an insufficient capacity error. Node queue1-dy-c-1-1
is set to
down
. The powered off dynamic node within the compute resource
(queue2-dy-c-1
) is set to down
.
You can check the node reason with scontrol show nodes
.
$
scontrol show nodes queue1-dy-c-1-1
NodeName=broken-dy-c-2-1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=96 CPULoad=0.00 ... ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=(Code:InsufficientInstanceCapacity)Failure when resuming nodes [root@2022-03-10T22:17:50]
$
scontrol show nodes queue1-dy-c-1-2
NodeName=broken-dy-c-2-1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=96 CPULoad=0.00 ... ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=(Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity [root@2022-03-10T22:17:50]
The job is queued to another instance type within the queue compute resources.
After the insufficient_capacity_timeout
elapses, nodes in the compute resource
are reset to the idle~
state.
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1* up infinite 30 idle~ queue1-dy-c-1-[1-15],queue1-dy-c-2-[1-15] queue2 up infinite 30 idle~ queue2-dy-c-1-[1-15],queue2-dy-c-2-[1-15]
After the insufficient_capacity_timeout
elapses and nodes in the compute resource are reset to the idle~
state, the
Slurm scheduler gives the nodes lower priority. The scheduler keeps selecting nodes from other queue compute resources with higher weights
unless one of the following occurs:
-
A job's submission requirements match the recovered compute resource.
-
No other compute resources are available because they are at capacity.
-
slurmctld
is restarted. -
The AWS ParallelCluster compute fleet is stopped and started to power down and power up all nodes.
Related logs
Logs related to insufficient capacity errors and fast insufficient capacity fail-over mode can be found in Slurm's resume
log
and clustermgtd
log in the head node.
- Slurm
resume
(/var/log/parallelcluster/slurm_resume.log
) -
Error messages when a node fails to launch because of insufficient capacity.
[slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: dcd0c252-90d4-44a7-9c79-ef740f7ecd87 [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['queue1-dy-c-1-1']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c.
- Slurm
clustermgtd
(/var/log/parallelcluster/clustermgtd
) -
Compute resource c-1 in queue1 is disabled because of insufficient capacity.
[slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'queue1': {'c-1': ComputeResourceFailureEvent(timestamp=datetime.datetime(2022, 4, 14, 23, 0, 4, 769380, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources are reset after insufficient capacity timeout (600 seconds) expired
After the insufficient capacity timeout expires, the compute resource is reset, nodes within the compute resources are set to
idle~
.[root:_reset_insufficient_capacity_timeout_expired_nodes] - INFO - Reset the following compute resources because insufficient capacity timeout expired: {'queue1': ['c-1']}