|« PreviousNext »|
|Did this page help you? Yes | No | Tell us about it...|
Amazon EMR simplifies running Hadoop and related big-data applications on AWS. You can use it to manage and analyze vast amounts of data. For example, a cluster (also called a cluster in Amazon EMR),can be configured to process petabytes of data.
In order to develop and deploy custom Hadoop applications, you used to need access to a lot of hardware for your Hadoop programs. Amazon EMR makes it easy to spin up a set of EC2 instances as virtual servers to run your Hadoop cluster. You can run various server configurations, such as fully-loaded production servers and temporary testing servers, without having to purchase or reconfigure hardware. Amazon EMR makes it easy to configure and deploy your always-on production clusters, but also to easily terminate unused testing clusters after your development and testing phase is complete.
Amazon EMR provides several types of clusters that you can launch to run custom Hadoop map-reduce code, depending on the type of program you're developing and the libraries you intend to use.
Run your custom map-reduce program written in Java. This cluster provides low-level access to the MapReduce API. You have the responsibility of defining and implementing the map reduce tasks in your Java application.
This type of cluster installs the Cascading Java library, which provides features such as splitting and joining data streams. Using the Cascading Java library can simplify application development. With a Cascading cluster you can still access the low-level MapReduce APIs as you can with the Custom JAR cluster type.
Run a single Hadoop job based on map and reduce functions you upload to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.
You can also use Amazon EMR to analyze and process data without writing a line of code. Several open-source applications run on top of Hadoop and make it possible to run map-reduce jobs and manipulate data using either a SQL-like syntax or a specialized language called Pig Latin. Amazon EMR is integrated with Apache Hive and Apache Pig.
Distributed storage is a way to store large amounts of data over a distributed network of computers with redundancy to protect against data loss. Amazon EMR is integrated with the Hadoop Distributed File System (HDFS) and Apache HBase.
You can use Amazon EMR to move large amounts of data in and out of databases and data stores. By distributing the work, the data can be moved quickly. Amazon EMR provides custom libraries to move data in and out of Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and Apache HBase.