Using columnar formats for better query performance - AWS Prescriptive Guidance

Using columnar formats for better query performance

Spark can use various input file formats, such as Apache Parquet, Optimized Row Columnar (ORC), and CSV. However, Parquet works best within Spark SQL. It provides faster runtimes, higher scan throughput, reduced disk I/O, and lower cost of operation. Spark can automatically filter useless data by using Parquet file statistical data by push-down filters, such as min-max statistics. On the other hand, you can enable Spark parquet vectorized reader to read Parquet files by batch. When you are using Spark SQL to process data, we recommend that you use Parquet file formats if possible.