Slurm log rotations
SageMaker HyperPod provides automatic log rotation for Slurm daemon logs to help manage disk space usage and maintain system performance. Log rotation is crucial for preventing logs from consuming excessive disk space and ensuring optimal system operation by automatically archiving and removing old log files while maintaining recent logging information. Slurm log rotations are enabled by default when you create a cluster.
How log rotation works
When enabled, the log rotation configuration:
-
Monitors all Slurm log files with the extension
.log
located in the/var/log/slurm/
folder on the controller, login and compute nodes. -
Rotates logs when they reach 50 MB in size.
-
Maintains up to two rotated log files before deleting them.
-
Sends SIGUSR2 signal to Slurm daemons (
slurmctld
,slurmd
, andslurmdbd
) after rotation.
List of log files rotated
Slurm logs are located in the /var/log/slurm/
directory. Log rotation is enabled for all
files that match /var/log/slurm/*.log
. When rotation occurs, rotated files have numerical suffixes
(such as slurmd.log.1
). The following list is not exhaustive but shows some of the critical log files that rotate automatically:
-
/var/log/slurm/slurmctld.log
-
/var/log/slurm/slurmd.log
-
/var/log/slurm/slurmdb.log
-
/var/log/slurm/slurmrestd.log
Enable or disable log rotation
You can control the log rotation feature using the enable_slurm_log_rotation
parameter in the
config.py
script of your cluster's lifecycle scripts, as shown in the following example:
class Config: # Set false if you want to disable log rotation of Slurm daemon logs enable_slurm_log_rotation = True # Default value
To disable log rotation, set the parameter to False
, as shown in the following example:
enable_slurm_log_rotation = False
Note
Lifecycle scripts run on all Slurm nodes (controller, login, and compute nodes) during cluster creation.
They also run on new nodes when added to the cluster. Updating the log rotation configurations must be
done manually after cluster creation. The log rotation configuration is stored in /etc/logrotate.d/sagemaker-hyperpod-slurm
.
We recommend keeping log rotation enabled to prevent log files from consuming excessive disk space. To disable
log rotation, delete the sagemaker-hyperpod-slurm
file or comment out its contents by
adding #
at the start of each line in the sagemaker-hyperpod-slurm
file.
Default log rotation settings
The following settings are configured automatically for each log file rotated:
Setting | Value | Description |
---|---|---|
rotate |
2 | Number of rotated log files to keep |
size |
50 MB | Maximum size before rotation |
copytruncate |
enabled | Copies and truncates the original log file |
compress |
disabled | Rotated logs are not compressed |
missingok |
enabled | No error if log file is missing |
notifempty |
enabled | Doesn't rotate empty files |
noolddir |
enabled | Rotated files stay in same directory |