Slurm log rotations - Amazon SageMaker AI

Slurm log rotations

SageMaker HyperPod provides automatic log rotation for Slurm daemon logs to help manage disk space usage and maintain system performance. Log rotation is crucial for preventing logs from consuming excessive disk space and ensuring optimal system operation by automatically archiving and removing old log files while maintaining recent logging information. Slurm log rotations are enabled by default when you create a cluster.

How log rotation works

When enabled, the log rotation configuration:

  • Monitors all Slurm log files with the extension .log located in the /var/log/slurm/ folder on the controller, login and compute nodes.

  • Rotates logs when they reach 50 MB in size.

  • Maintains up to two rotated log files before deleting them.

  • Sends SIGUSR2 signal to Slurm daemons (slurmctld, slurmd, and slurmdbd) after rotation.

List of log files rotated

Slurm logs are located in the /var/log/slurm/ directory. Log rotation is enabled for all files that match /var/log/slurm/*.log. When rotation occurs, rotated files have numerical suffixes (such as slurmd.log.1). The following list is not exhaustive but shows some of the critical log files that rotate automatically:

  • /var/log/slurm/slurmctld.log

  • /var/log/slurm/slurmd.log

  • /var/log/slurm/slurmdb.log

  • /var/log/slurm/slurmrestd.log

Enable or disable log rotation

You can control the log rotation feature using the enable_slurm_log_rotation parameter in the config.py script of your cluster's lifecycle scripts, as shown in the following example:

class Config: # Set false if you want to disable log rotation of Slurm daemon logs enable_slurm_log_rotation = True # Default value

To disable log rotation, set the parameter to False, as shown in the following example:

enable_slurm_log_rotation = False
Note

Lifecycle scripts run on all Slurm nodes (controller, login, and compute nodes) during cluster creation. They also run on new nodes when added to the cluster. Updating the log rotation configurations must be done manually after cluster creation. The log rotation configuration is stored in /etc/logrotate.d/sagemaker-hyperpod-slurm. We recommend keeping log rotation enabled to prevent log files from consuming excessive disk space. To disable log rotation, delete the sagemaker-hyperpod-slurm file or comment out its contents by adding # at the start of each line in the sagemaker-hyperpod-slurm file.

Default log rotation settings

The following settings are configured automatically for each log file rotated:

Setting Value Description
rotate 2 Number of rotated log files to keep
size 50 MB Maximum size before rotation
copytruncate enabled Copies and truncates the original log file
compress disabled Rotated logs are not compressed
missingok enabled No error if log file is missing
notifempty enabled Doesn't rotate empty files
noolddir enabled Rotated files stay in same directory