One AWS Account Hosting One EMR Cluster Per Student - Teaching Big Data Skills with Amazon EMR

One AWS Account Hosting One EMR Cluster Per Student

This approach is our recommended deployment model, because it allows for central IT administration of all course resources while also exposing students to the AWS environment, providing a better understanding of how the infrastructure is configured. This option can also be the foundation to implement a packaged offering that leverages AWS Service Catalog to provide an end-to-end self-service deployment option for EMR clusters. A Service Catalog approach to deploying course infrastructure can further simplify the control and administration of your EMR clusters, providing students with a dashboard with only basic deploy and tear-down options.

One AWS account hosting one EMR cluster per student

Figure 1: One AWS account hosting one EMR cluster per student

In this deployment option, an institution's IT department provisions a single AWS account to provide access to a separate EMR cluster for each student by using AWS Identity and Access Management (IAM). Each student maps to a unique IAM user and SSH keypair to be provided with least-privilege access to their own EMR cluster and data sources. With this deployment option, billing of the EMR resources can be tied back to a central, university-owned payer account using Consolidated Billing for AWS Organizations to simplify deployment. Customized billing alerts can also be created to alert IT administrators to resource usage that rises above a certain cost threshold.

In terms of administrative overhead, this deployment option can consume a large amount of resources without automation to tear down resources when not in use. Standardization of configuration can easily be achieved by using AWS CloudFormation. With CloudFormation, a new stack can be deployed for each student only when resources are needed.

In terms of cost, each student runs a small-sized EMR cluster for a short period for each assignment. Computing costs are based off a per second usage of resources with a minimum of 60 seconds, only for the resources that are used. It is the responsibility of the course administrator and students to terminate clusters that are not in use. This can be automated with an AWS Lambda function or by establishing a Time to Live (TTL) value on each AWS CloudFormation template. With this type of deployment, students are exposed directly to the AWS Management Console or API to provision their own clusters. This approach requires students to properly terminate resources when workloads are completed, or have administrators automate the de-provisioning of student EMR clusters with an automated script. See Scheduling automatic deletion of AWS CloudFormation stacks on the AWS Infrastructure & Automation Blog.