Using columnar formats for better query performance

Spark can use various input file formats, such as Apache Parquet, Optimized Row Columnar (ORC), and CSV. However, Parquet works best within Spark SQL. It provides faster runtimes, higher scan throughput, reduced disk I/O, and lower cost of operation. Spark can automatically filter useless data by using Parquet file statistical data by push-down filters, such as min-max statistics. On the other hand, you can enable Spark parquet vectorized reader to read Parquet files by batch. When you are using Spark SQL to process data, we recommend that you use Parquet file formats if possible.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Using the cost-based optimizer

FAQ