Loosely Coupled Scenarios
A loosely coupled workload entails the processing of a large number of smaller jobs. Generally, the smaller job runs on one node, either consuming one process or multiple processes with shared memory parallelization (SMP) for parallelization within that node.
The parallel processes, or the iterations in the simulation, are post-processed to create one solution or discovery from the simulation. Loosely coupled applications are found in many areas, including Monte Carlo simulations, image processing, genomics analysis, and Electronic Design Automation (EDA).
The loss of one node or job in a loosely coupled workload usually doesn’t delay the entire calculation. The lost work can be picked up later or omitted altogether. The nodes involved in the calculation can vary in specification and power.
A suitable architecture for a loosely coupled workload has the following considerations:
-
Network: Because parallel processes do not typically interact with each other, the feasibility or performance of the workloads is not sensitive to the bandwidth and latency capabilities of the network between instances. Therefore, clustered placement groups are not necessary for this scenario because they weaken the resiliency without providing a performance gain.
-
Storage: Loosely coupled workloads vary in storage requirements and are driven by the dataset size and desired performance for transferring, reading, and writing the data.
-
Compute: Each application is different, but in general, the application’s memory-to-compute ratio drives the underlying EC2 instance type. Some applications are optimized to take advantage of graphics processing units (GPUs) or field-programmable gate array (FPGA) accelerators on EC2 instances.
-
Deployment: Loosely coupled simulations often run across many — sometimes millions — of compute cores that can be spread across Availability Zones without sacrificing performance. Loosely coupled simulations can be deployed with end-to-end services and solutions such as AWS Batch and AWS ParallelCluster, or through a combination of AWS services, such as Amazon Simple Queue Service (Amazon SQS), Auto Scaling, AWS Lambda, and AWS Step Functions.