Best practices - AWS ParallelCluster

Best practices

Best practices: master instance type selection

Although the master node doesn't execute any job, its functions and its sizing are crucial to the overall performance of the cluster.

When choosing the instance type to use for your master node you want to evaluate the following items:

  • Cluster size: the master node orchestrates the scaling logic of the cluster and is responsible of attaching new nodes to the scheduler. If you need to scale up and down the cluster of a considerable amount of nodes then you want to give the master node some extra compute capacity.

  • Shared file systems: when using shared file systems to share artifacts between compute nodes and the master node take into account that the master is the node exposing the NFS server. For this reason you want to choose an instance type with enough network bandwidth and enough dedicated Amazon EBS bandwidth to handle your workflows.

Best practices: network performance

There are three hints that cover the whole range of possibilities to improve network communication.

  • Placement group: a cluster placement group is a logical grouping of instances within a single Availability Zone. For more information on placement groups, see placement groups in the Amazon EC2 User Guide for Linux Instances. You can configure the cluster to use your own placement group with placement_group = your-placement-group-name or let AWS ParallelCluster create a placement group with the "compute" strategy with placement_group = DYNAMIC. For more information, see placement_group for multiple queue mode and placement_group for single queue mode.

  • Enhanced networking: consider to choose an instance type that supports Enhanced Networking. For more information, see enhanced networking on Linux in the Amazon EC2 User Guide for Linux Instances.

  • Elastic Fabric Adapter: To support high levels of scaleable inter-instance communication, consider choosing EFA network interfaces for your network. The EFA's custom-built operating system (OS) bypass hardware enhances inter-instance communications with the on-demand elasticity and flexibility of the AWS cloud. To configure a single Slurm cluster queue to use EFA, set enable_efa = true. For more information about using EFA with AWS ParallelCluster, see Elastic Fabric Adapter and enable_efa. For more information about EFA, see Elastic Fabric Adapter in the Amazon EC2 User Guide for Linux Instances.

  • Instance bandwidth: the bandwidth scales with instance size, please consider to choose the instance type which better suits your needs, see Amazon EBS–optimized instances and Amazon EBS volume types in the Amazon EC2 User Guide for Linux Instances.

Best practices: budget alerts

To manage AWS ParallelCluster resource costs, we recommend that you use AWS Budgets actions to create a budget and defined budget threshold alerts for selected AWS resources. For more information, see Configuring a budget action in the AWS Budgets User Guide. You can also use Amazon CloudWatch to create a billing alarm. For more information, see Creating a billing alarm to monitor your estimated AWS charges.