Use S3 Select with Spark to improve query performance
Important
Amazon S3 Select is no longer available to new customers. Existing customers of Amazon S3 Select can continue to use
the feature as usual. Learn more
With Amazon EMR release 5.17.0 and later, you can use S3 Select
S3 Select is supported with CSV and JSON files using s3selectCSV
and
s3selectJSON
values to specify the data format. For more
information and examples, see Specify S3 Select in your
code.
Is S3 Select right for my application?
We recommend that you benchmark your applications with and without S3 Select to see if using it may be suitable for your application.
Use the following guidelines to determine if your application is a candidate for using S3 Select:
-
Your query filters out more than half of the original data set.
-
Your network connection between Amazon S3 and the Amazon EMR cluster has good transfer speed and available bandwidth. Amazon S3 does not compress HTTP responses, so the response size is likely to increase for compressed input files.
Considerations and limitations
-
Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption are not supported.
-
The
AllowQuotedRecordDelimiters
property is not supported. If this property is specified, the query fails. -
Only CSV and JSON files in UTF-8 format are supported. Multi-line CSVs are not supported.
-
Only uncompressed or gzip files are supported.
-
Spark CSV and JSON options such as
nanValue
,positiveInf
,negativeInf
, and options related to corrupt records (for example, failfast and dropmalformed mode) are not supported. -
Using commas (,) within decimals is not supported. For example,
10,000
is not supported and10000
is. -
Comment characters in the last line are not supported.
-
Empty lines at the end of a file are not processed.
-
The following filters are not pushed down to Amazon S3:
-
Aggregate functions such as
COUNT()
andSUM()
. -
Filters that
CAST()
an attribute. For example,CAST(stringColumn as INT) = 1
. -
Filters with an attribute that is an object or is complex. For example,
intArray[1] = 1, objectColumn.objectNumber = 1
. -
Filters for which the value is not a literal value. For example,
intColumn1 = intColumn2
-
Only S3 Select supported data types are supported with the documented limitations.
-
Specify S3 Select in your code
The following examples demonstrate how to specify S3 Select for CSV using Scala, SQL, R, and PySpark. You can use S3 Select for JSON in the same way. For a listing of options, their default values, and limitations, see Options.
Options
The following options are available when using s3selectCSV
and s3selectJSON
. If not specified, default values are
used.
Options with S3selectCSV
Option | Default | Usage |
---|---|---|
|
|
Indicates whether compression is used.
|
|
"," |
Specifies the field delimiter. |
|
|
Specifies the quote character. Specifying an empty string is not supported and results in a malformed XML error. |
|
|
Specifies the escape character. |
|
|
|
comment |
|
Specifies the comment character. The comment
indicator cannot be disabled. In other words, a
value of |
|
"" |
Options with S3selectJSON
Option | Default | Usage |
---|---|---|
|
|
Indicates whether compression is used.
|
|
"false" |
|