Menu
Amazon EMR
Developer Guide

Impala Performance Testing and Query Optimization

When using Impala, it is important to understand how your cluster’s memory resources limit the query types and dataset sizes you can process with them. Inspired by TPCDS and Berkeley's Big Data Benchmark, we implemented a workload generator which generates table files of specified sizes in text file format. We designed a spectrum of relational queries to test Impala's performance of whole table scans, aggregations and joins across different number of tables. We executed these queries against different input classes on clusters of different instance types. The performance data is compared against that of Hive's to help assess Impala's strength and limitations. Also, the methods used in these tests are the basis for the Launching and Querying Impala Clusters on Amazon EMR tutorial. For more information, see Tutorial: Launching and Querying Impala Clusters on Amazon EMR.

Database Schema

The input data set consists of three tables as shown with the following table creation statements in Impala SQL dialect.

Copy
CREATE EXTERNAL TABLE books( id BIGINT, isbn STRING, category STRING, publish_date TIMESTAMP, publisher STRING, price FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/books/'; CREATE EXTERNAL TABLE customers( id BIGINT, name STRING, date_of_birth TIMESTAMP, gender STRING, state STRING, email STRING, phone STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/customers/'; CREATE EXTERNAL TABLE transactions( id BIGINT, customer_id BIGINT, book_id BIGINT, quantity INT, transaction_date TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/transactions/';

Sample Data

All tables are populated with randomly generated data that resembles real-world values. You can generate data in the same method outlined in the Generate Sample Data section of the tutorial. For more information, see Generate Test Data.

Copy
$ head books/books 0|1-45812-668-3|EDUCATION|1986-06-14|Shinchosha|50.99 1|9-69091-140-1|BODY-MIND-SPIRIT|1983-07-29|Lefebvre-Sarrut|91.99 2|3-73425-809-9|TRANSPORTATION|1996-07-08|Mondadori|54.99 3|8-23483-356-2|FAMILY-RELATIONSHIPS|2002-08-20|Lefebvre-Sarrut|172.99 4|3-58984-308-3|POETRY|1974-06-13|EKSMO|155.99 5|2-34120-729-8|TRAVEL|2004-06-30|Cengage|190.99 6|0-38870-277-1|TRAVEL|2013-05-26|Education and Media Group |73.99 7|8-74275-772-8|LAW|2012-05-01|Holtzbrinck|112.99 8|4-41109-927-4|LITERARY-CRITICISM|1986-04-06|OLMA Media Group|82.99 9|8-45276-479-4|TRAVEL|1998-07-04|Lefebvre-Sarrut|180.99 $ head customers/customers 0|Bailey RUIZ|1947-12-19|M|CT|bailey.ruiz.1947@hotmail.com|114-925-4866 1|Taylor BUTLER|1938-07-30|M|IL|taylor.butler.1938@yahoo.com|517-158-1597 2|Henry BROOKS|1956-12-27|M|IN|henry.brooks.1956@yahoo.com|221-653-3887 3|Kaitlyn WRIGHT|1988-11-20|F|NE|kaitlyn.wright.1988@hotmail.com|645-726-8901 4|Miles LOPEZ|1956-03-15|F|ND|miles.lopez.1956@hotmail.com|222-770-7004 5|Mackenzie PETERSON|1970-09-05|F|NJ|mackenzie.peterson.1970@outlook.com|114-521-5716 6|Maria SPENCER|2002-12-20|F|TX|maria.spencer.2002@yahoo.com|377-612-4105 7|Sienna HENDERSON|1982-11-04|F|MO|sienna.henderson.1982@gmail.com|199-288-5062 8|Samantha WALLACE|1998-03-06|F|TN|samantha.wallace.1998@hotmail.com|711-348-7410 9|Nevaeh PETERSON|1991-06-26|F|AL|nevaeh.peterson.1991@live.com|651-686-3436 $ head transactions/transactions 0|360677155|84060207|4|2010-03-24 10:24:22 1|228662770|136084430|5|2009-07-03 14:53:09 2|355529188|26348618|9|2009-09-13 11:53:26 3|1729168|20837134|5|2006-01-05 19:31:19 4|196166644|99142444|19|2007-01-02 15:07:38 5|43026573|479157832|17|2010-04-14 16:42:29 6|306402023|356688712|12|2010-05-24 22:15:54 7|359871959|312932516|31|2000-04-03 11:06:38 8|379787207|265709742|45|2013-09-09 06:01:06 9|144155611|137684093|11|2010-06-06 17:07:07

Table Size

The following table shows the row count for each table (in millions of rows). The GB value indicates the size of the text file of each table. Within an input class, the books, customers, and transactions tables always have the same size.

Input Class (size of each table) Books table (Million Rows) Customers table (Million Rows) Transactions table (Million Rows)
4GB 63 53 87
8GB 125 106 171
16GB 249 210 334
32GB 497 419 659
64GB 991 837 1304
128GB 1967 1664 2538
256GB 3919 3316 5000

Queries

We used four different query types in our performance testing:

Q1: Scan Query

Copy
SELECT COUNT(*) FROM customers WHERE name = 'Harrison SMITH';

This query performs a table scan through the entire table. With this query, we mainly test:

  • Impala's read throughput compared to that of Hive.

  • With a given aggregated memory size, is there a limit on input size when performing a table scan, and if yes, what is the maximum input size that Impala can handle?

Q2: Aggregation Query

Copy
SELECT category, count(*) cnt FROM books GROUP BY category ORDER BY cnt DESC LIMIT 10;

The aggregation query scans a single table, groups the rows, and calculates the size of each group.

Q3: Join Query between Two Tables

Copy
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM books JOIN [SHUFFLE] transactions ON ( transactions.book_id = books.id AND YEAR(transactions.transaction_date) BETWEEN 2008 AND 2010 ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10;

This query joins the books table with the transactions table to find the top 10 book categories with the maximum total revenue during a certain period of time. In this experiment, we test Impala's performance on a join operation and compare these results with Hive.

Q4: Join Query between Three Tables

Copy
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM books JOIN [SHUFFLE] transactions ON ( transactions.book_id = books.id ) JOIN [SHUFFLE] customers ON ( transactions.customer_id = customers.id AND customers.state IN ('WA', 'CA', 'NY') ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10;

This fourth query joins three tables, and is the most memory intensive query in our performance testing.

Performance Test Results

The first set of experimental results were obtained on a cluster of four m1.xlarge instances with Amazon EMR Hadoop 2.2.0 and Impala 1.1.1 installed. According to the hardware specification shown below, an aggregated memory of 60 GB is available on the cluster.

Instance Type Processor Architecture vCPUs ECU Memory (GiB) Instance Storage (GB)
m1.xlarge 64-bit 4 8 15 4 x 420

We compared the query performance of Impala and Hive in terms of the query execution time and show the results in the charts below. In these charts, the y axis shows the average execution time measured using the time command from four trials. The missing data indicates Impala failed due to out-of-memory issue, and we did not test Hive against these failed Impala queries.

From these figures, we observed that at smaller scales (in this experiment, 16 GB and lower), Impala is much faster than Hive due to the absence of the MapReduce framework overhead. Nonetheless, when the input data set is large enough such that the framework overhead is negligible compared to overall query time, Impala is only about 3 to 10 times faster.

The second experimental environment was a cluster of 4 m2.4xlarge instances with an AMI with Hadoop 2.2.0 and Impala 1.1.1. The aggregated memory on this cluster is 274 GB. Detailed hardware specifications and experimental results are shown below. Like our first set of tests, missing data indicates Impala failures caused by out-of-memory issues, and we declined to run Hive tests for these queries.

Instance Type Processor Architecture vCPUs ECU Memory (GiB) Instance Storage (GB)
m2.4xlarge 64-bit 8 26 68.4 2 x 840

Optimizing Queries

Impala's memory requirement is determined by query type. There are no simple and generic rules to determine the correlation between the maximum data size that a cluster can process with its aggregated memory size.

Impala does not load entire tables into memory, so the amount of available memory doesn't limit the table size that it can handle. Impala builds hash tables in memory, such as the right-hand side table of a join or the result set of an aggregation. In addition, Impala uses memory as I/O buffers, where the number of processor cores on the cluster and the speed of the scanners determine the amount of buffering that is necessary in order to keep all cores busy. For example, a simple SELECT count(*) FROM table statement only uses I/O buffer memory.

For example, our m1.xlarge cluster in part 1 of our experiment only had 60 GB of memory, but when we performed single table scan, we were able to process tables of 128 GB and above. Because Impala didn’t need to cache the entire result set of the query, it streamed the result set back to the client. In contrast, when performing a join operation, Impala may quickly use up a cluster’s memory even if the aggregated table size is smaller than the aggregated amount of memory. To make full use of the available resources, it is extremely important to optimize your queries. In this section, we take Q3 as an example to illustrate some of the optimization techniques you can try when an out-of-memory error happens.

Shown below is the typical error message you receive when the impalad process on a particular data node crashed due to a memory issue. To confirm the out-of-memory issue, you can simply log on to the data nodes and use the top command to monitor the memory usage (%MEM). Note that even for the same query, the out-of-memory error may not always happen on the same node. Also, no action is needed to recover from an out-of-memory error, because impalad is restarted automatically.

Copy
Backend 6:Couldn't open transport for ip-10-139-0-87.ec2.internal:22000(connect() failed: Connection refused)

Simple query optimization techniques might be effective in allowing your queries to use less memory, allowing you to sidestep an out-of-memory error. For example, the first version of Q3 (pre-optimization) is shown below, where the transactions table is on the left side of JOIN while the books table is on the right:

Copy
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM transactions JOIN books ON ( transactions.book_id = books.id AND YEAR(transactions.transaction_date) BETWEEN 2008 AND 2010 ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10;

This query only worked for the 4 GB input class and failed for 8 GB and above due to the out-of-memory error. To understand why, you must consider how Impala executes queries. In preparation for the join, Impala builds a hash table from the books table that contains only the columns category, price, and id. Nothing of the transactions table is cached in memory. However, because Impala broadcasts the right-side table in this example, the books table is replicated to all the nodes that require the books table for joining. In versions of Impala newer than 1.2.1, Impala makes a cost-based decision between broadcast and partitioned join based on table statistics. We simply swapped these two tables in the JOIN statement to get the second version of Q3 shown below:

Copy
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM books JOIN transactions ON ( transactions.book_id = books.id AND YEAR(transactions.transaction_date) BETWEEN 2008 AND 2010 ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10;

The second version of Q3 is more efficient, because only a part of the transactions table is broadcast instead of the entire books table. Nonetheless, when we scaled up to the 32 GB input class, even the second version of Q3 started to fail due to memory constraints. To further optimize the query, we added a hint to force Impala to use the "partitioned join," which creates the final version of Q3 as shown above in the Queries section. With all the optimizations, we eventually managed to execute Q3 successfully for input classes up to 64 GB, giving us a 16x more memory-efficient query than the first version. There are many other ways to optimize queries for Impala, and we see these methods as a good way to get the best performance from your hardware and avoid out-of-memory errors.