In the base configuration lifecycle scripts that the SageMaker HyperPod team provides
also includes installation of various metric exporter packages. To activate the
installation step, the only thing you need to do is to set the parameter
enable_observability=True
in the config.py
Name | Script deployment target node | Exporter description |
Slurm exporter for Prometheus |
Head (controller) node |
Exports Slurm Accounting metrics. |
Compute node |
Exports metrics from cluster nodes and EFA. The package is a fork
of the Prometheus node exporter |
|
Compute node |
Exports NVIDIA DCGM metrics about health and performance of NVIDIA GPUs. |
With enable_observability=True
in the config.py
lifecycle_script.py
# Install metric exporting software and Prometheus for observability
if Config.enable_observability:
if node_type == SlurmNodeType.COMPUTE_NODE:
ExecuteBashScript("./utils/install_docker.sh").run()
ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()
if node_type == SlurmNodeType.HEAD_NODE:
wait_for_scontrol()
ExecuteBashScript("./utils/install_docker.sh").run()
ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
ExecuteBashScript("./utils/install_prometheus.sh").run()
On the compute nodes, the script installs the NVIDIA Data Center GPU Management (DCGM) exporter and the Elastic Fabric Adapter (EFA) node exporter. The DCGM exporter is an exporter for Prometheus that collects metrics from NVIDIA GPUs, enabling monitoring of GPU usage, performance, and health. The EFA node exporter, on the other hand, gathers metrics related to the EFA network interface, which is essential for low-latency and high-bandwidth communication in HPC clusters.
On the head node, the script installs the Slurm exporter for Prometheus and the Prometheus open-source
software
Note that the lifecycle scripts are designed to install all the exporter packages as
docker containers, so the Docker package also should be installed on both the head and
compute nodes. The scripts for these components are conveniently provided in the utils
After you have successfully set up your HyperPod cluster installed with the exporter packages, proceed to the next topic to finish setting up Amazon Managed Service for Prometheus and Amazon Managed Grafana.