Using Amazon Elastic MapReduce (Amazon EMR) to run Hadoop on Amazon Web Services offers many advantages.
When you run your Hadoop cluster on Amazon EMR, you can easily expand or shrink the number of virtual servers in your cluster depending on your processing needs. Adding or removing servers takes minutes, which is much faster than making similar changes in clusters running on physical servers.
By running your cluster on Amazon EMR, you only pay for the computational resources you use. You do not pay ongoing overhead costs for hardware maintenance and upgrades and you do not have to pre-purchase extra capacity to meet peak needs. For example, if the amount of data you process in a daily cluster peaks on Monday, you can increase the number of servers to 50 in the cluster that day, and then scale back to 10 servers in the clusters that run on other days of the week. You won't have to pay to maintain those additional 40 servers during the rest of the week as you would with physical servers. For more information, see Amazon Elastic MapReduce Pricing.
When you launch a cluster on Amazon EMR, the web service allocates the virtual server instances and configures them with the needed software for you. Within minutes you can have a cluster configured and ready to run your Hadoop application.
The version of Hadoop installed on Amazon EMR clusters is integrated with Amazon S3, which means that you can store your input and output data in Amazon S3, on the cluster in HDFS, or a mix of both. Amazon S3 can be accessed like a file system from applications running on your Amazon EMR cluster.
If your input data is stored in Amazon S3 you can have multiple clusters accessing the same data simultaneously.
Spot Instances are a way to purchase virtual servers for your cluster at a discount. Excess capacity in Amazon Web Services is offered at a fluctuating price, based on supply and demand. You set a maximum bid price that you wish pay for a certain configuration of virtual server. While the price of Spot Instances for that type of server are below your bid price, the servers are added to your cluster and you are billed the Spot Price rate. When the Spot Price rises above your bid price, the servers are terminated.
For more information about how use Spot Instances effectively in your cluster, see (Optional) Lower Costs with Spot Instances.
Amazon EMR is integrated with other Amazon Web Services such as Amazon EC2, Amazon S3, DynamoDB, Amazon RDS, CloudWatch, and AWS Data Pipeline. This means that you can easily access data stored in AWS from your cluster and you can make use of the functionality offered by other Amazon Web Services to manage your cluster and store the output of your cluster.
For example, you can use Amazon EMR to analyze data stored in Amazon S3 and output the results to Amazon RDS or DynamoDB. Using CloudWatch, you can monitor the performance of your cluster and you can automate recurring clusters with AWS Data Pipeline. As new services are added, you'll be able to make use of those new technologies as well. For more information, see Monitor Metrics with CloudWatch and Export, Import, Query, and Join Tables in DynamoDB Using Amazon EMR.
When you launch a cluster on Amazon EMR, you specify the size and capabilities of the virtual servers used in the cluster. This way you can match the virtualized servers to the processing needs of the cluster. You can choose virtual server instances to improve cost, speed up performance, or store large amounts of data.
For example, you might launch one cluster with high storage virtual servers to host a data warehouse, and launch a second cluster on virtual servers with high memory to improve performance. Because you are not locked into a given hardware configuration as you are with physical servers, you can adjust each cluster to your requirements. For more information about the server configurations available using Amazon EMR, see Choose the Number and Type of Virtual Servers.
Amazon EMR supports several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.
When you launch a cluster using Amazon EMR, you have root access to the cluster and can install software and configure the cluster before Hadoop starts. For more information, see (Optional) Create Bootstrap Actions to Install Additional Software.
You can manage your clusters using the Amazon EMR console (a web-based user interface), a command line interface, web service APIs, and a variety of SDKs. For more information, see What Tools are Available for Amazon EMR?.
You can run Amazon EMR in a Amazon VPC in which you configure networking and security rules. Amazon EMR also supports IAM users and roles which you can use to control access to your cluster and permissions that restrict what others can do on the cluster. For more information, see Configure Access to the Cluster.