| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
Grids are a form of distributed computing that enable a user to leverage multiple instances to perform parallel computations. Customers—such as Numerate, Scribd, and the University of Barcelona/University of Melbourne—use Grid Computing with Spot Instances because this type of architecture can take advantage of Spot Instance’s built-in elasticity and low prices to get work done faster at a more cost-effective price.
To get started, a user will break down the work into discrete units called jobs, and then submit that work to a “master node.” These jobs will be queued up, and a process called a “scheduler” will distribute that work out to other instances in the grid, called “worker nodes.” After the result is computed by the worker node, the master node is notified, and the worker node can take the next operation from the queue. If the job fails or the instance is interrupted, the job will automatically be re-queued by the scheduler process.
As you work to architect your application, it is important to choose the appropriate amount of work to be included in your job. We recommend breaking your jobs down into a logical grouping based on the time it would take to process. Typically, you will want to create a workload size less than an hour, so that if you have to process the workload again, it doesn’t cost you additional money (you don’t pay for the hour if we interrupt your instance).
Many customers use a Grid scheduler, such as Oracle Grid Engine or UniCloud, to set up a cluster. If you have long-running workloads, the best practice is to run the master node on On-Demand or Reserved Instances, and run the worker nodes on Spot or a mixture of On-Demand, Reserved, and Spot Instances. Alternatively, if you have a workload that is less than an hour or you are running a test environment, you may want to run all of your instances on Spot. No matter the setup, we recommend that you create a script to automatically re-add instances that may be interrupted. Some existing tools—StarCluster, for example— can help you manage this process.
Chris Dagdigian, from AWS Solution Provider BioTeam, provides a quick overview of how to start a cluster from scratch in about 10 to 15 minutes on Amazon EC2 Spot Instances using StarCluster. StarCluster is an open source tool created by a lab at MIT that makes it easy to set up a new Oracle Grid Engine cluster. In this video, Chris walks through the process of installing, setting up, and running simple jobs on a cluster. Chris also leverages Spot Instances, so that you can potentially get work done faster and potentially save between 50 percent to 66 percent.