Menu
Amazon EMR
Developer Guide

Impala Memory Considerations

Impala's memory requirements are decided by the type of query. There are no simple rules to determine the correlation between the maximum data size that a cluster can process and the aggregated memory size. The compression type, partitions, and the actual query (number of joins, result size, etc.) all play a role in the memory required. For example, your cluster may have only 60 GB of memory, but you can perform a single table scan and process 128 GB tables and larger. In contrast, when performing join operations, Impala may quickly use up the memory even though the aggregated table size is smaller than what's available. Therefore, to make full use of the available resources, it is extremely important to optimize the queries. You can optimize an Impala query for performance and to minimize resource consumption, and you can use the EXPLAIN statement to estimate the memory and other resources needed your query. In addition, for the best experience with Impala, we recommend using memory-optimized instances for your cluster. For more information, see Impala Performance Testing and Query Optimization.

You can run multiple queries at one time on an Impala cluster on Amazon EMR. However, as each query is completed in-memory, ensure that you have the proper resources on your cluster to handle the number of concurrent queries you anticipate. In addition, you can set up a multi-tenant cluster with both Impala and MapReduce installed. You should be sure to allocate resources (memory, disk, and CPU) to each application using YARN on Hadoop 2.x. The resources allocated should be dependent on the needs for the jobs you plan to run on each application. For more information, go to http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

If you run out of memory, queries fail and the Impala daemon installed on the affected node shuts down. Amazon EMR then restarts the daemon on that node so that Impala will be ready to run another query. Your data in HDFS on the node remains available, because only the daemon running on the node shuts down, rather than the entire node itself. For ad hoc analysis with Impala, the query time can often be measured in seconds; therefore, if a query fails, you can discover the problem quickly and be able to submit a new query in quick succession.