Install metrics exporter packages on your HyperPod cluster

Focus mode

Install metrics exporter packages on your HyperPod cluster - Amazon SageMaker AI

In the base configuration lifecycle scripts that the SageMaker HyperPod team provides also includes installation of various metric exporter packages. To activate the installation step, the only thing you need to do is to set the parameter enable_observability=True in the config.py file. The lifecycle scripts are designed to bootstrap your cluster with the following open-source metric exporter packages.

Name	Script deployment target node	Exporter description
Slurm exporter for Prometheus	Head (controller) node	Exports Slurm Accounting metrics.
Elastic Fabric Adapter (EFA) node exporter	Compute node	Exports metrics from cluster nodes and EFA. The package is a fork of the Prometheus node exporter.
NVIDIA Data Center GPU Management (DCGM) exporter	Compute node	Exports NVIDIA DCGM metrics about health and performance of NVIDIA GPUs.

With enable_observability=True in the config.py file, the following installation step is activated in the lifecycle_script.py script.


# Install metric exporting software and Prometheus for observability
if Config.enable_observability:
    if node_type == SlurmNodeType.COMPUTE_NODE:
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
        ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()

    if node_type == SlurmNodeType.HEAD_NODE:
        wait_for_scontrol()
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
        ExecuteBashScript("./utils/install_prometheus.sh").run()

On the compute nodes, the script installs the NVIDIA Data Center GPU Management (DCGM) exporter and the Elastic Fabric Adapter (EFA) node exporter. The DCGM exporter is an exporter for Prometheus that collects metrics from NVIDIA GPUs, enabling monitoring of GPU usage, performance, and health. The EFA node exporter, on the other hand, gathers metrics related to the EFA network interface, which is essential for low-latency and high-bandwidth communication in HPC clusters.

On the head node, the script installs the Slurm exporter for Prometheus and the Prometheus open-source software. The Slurm exporter provides Prometheus with metrics related to Slurm jobs, partitions, and node states.

Note that the lifecycle scripts are designed to install all the exporter packages as docker containers, so the Docker package also should be installed on both the head and compute nodes. The scripts for these components are conveniently provided in the utils folder of the Awsome Distributed Training GitHub repository.

After you have successfully set up your HyperPod cluster installed with the exporter packages, proceed to the next topic to finish setting up Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Complete prerequisites

Validate Prometheus setup

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Install metrics exporter packages on your HyperPod cluster

Related resources

Did this page help you?

Related resources

Next topic:

Previous topic:

Need help?