Supported data types
The following data types in Amazon Redshift are supported with the Spark connector. For a complete list of supported data types in Amazon Redshift, see Data types. If a data type is not in the table below, it's not supported in the Spark connector.
Data type | Aliases |
---|---|
SMALLINT | INT2 |
INTEGER | INT, INT4 |
BIGINT | INT8 |
DECIMAL | NUMERIC |
REAL | FLOAT4 |
DOUBLE PRECISION | FLOAT8, FLOAT |
BOOLEAN | BOOL |
CHAR | CHARACTER, NCHAR, BPCHAR |
VARCHAR | CHARACTER VARYING, NVARCHAR, TEXT |
DATE | |
TIMESTAMP | Timestamp without time zone |
TIMESTAMPTZ | Timestamp with time zone |
SUPER | |
TIME | Time without time zone |
TIMETZ | Time with time zone |
VARBYTE | VARBINARY, BINARY VARYING |
Complex data types
You can use the spark connector to read and write Spark complex data types such
as ArrayType
, MapType
, and StructType
to and
from Redshift SUPER data type columns. If you provide a schema during a read
operation, the data in the column will be converted to its corresponding complex
types in Spark, including any nested types. Additionally, if
autopushdown
is enabled, projection of nested attributes, map
values, and array indices are pushed down to Redshift so that the entire nested
data structure no longer needs to be unloaded when accessing just a portion of the
data.
When you write DataFrames from the connector, any column of type
MapType
(using StringType
), StructType
,
or ArrayType
is written to a Redshift SUPER data type column. When
writing these nested data structures, the tempformat
parameter must be
of type CSV
, CSV GZIP
, or PARQUET
. Using
AVRO
will cause an exception. Writing a MapType
data
structure that has a key type other than StringType
will also cause an
exception.
StructType
The following example demonstrates how to create a table with a SUPER data type that contains a struct
create table contains_super (a super);
You can then use the connector to query a StringType
field
hello
from the SUPER column a
in the table using a
schema like in the following example.
import org.apache.spark.sql.types._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", StructType(StructField("hello", StringType) ::Nil)) :: Nil) val helloDF = sqlContext.read .format("io.github.spark_redshift_community.spark.redshift") .option("url", jdbcURL ) .option("tempdir", tempS3Dir) .option("dbtable", "contains_super") .schema(schema) .load().selectExpr("a.hello")
The following example demonstrates how to write a struct to the column
a
.
import org.apache.spark.sql.types._ import org.apache.spark.sql._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", StructType(StructField("hello", StringType) ::Nil)) :: Nil) val data = sc.parallelize(Seq(Row(Row("world")))) val mydf = sqlContext.createDataFrame(data, schema) mydf.write.format("io.github.spark_redshift_community.spark.redshift"). option("url", jdbcUrl). option("dbtable", tableName). option("tempdir", tempS3Dir). option("tempformat", "CSV"). mode(SaveMode.Append).save
MapType
If you prefer to use a MapType
to represent your data, then you
can use a MapType
data structure in your schema and retrieve the
value corresponding to a key in the map. Note that all keys in your
MapType
data structure must be of type String, and all of the
values must of the same type, such as int.
The following example demonstrates how to get the value of the key
hello
in the column a
.
import org.apache.spark.sql.types._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", MapType(StringType, IntegerType))::Nil) val helloDF = sqlContext.read .format("io.github.spark_redshift_community.spark.redshift") .option("url", jdbcURL ) .option("tempdir", tempS3Dir) .option("dbtable", "contains_super") .schema(schema) .load().selectExpr("a['hello']")
ArrayType
If the column contains an array instead of a struct, you can use the connector to query the first element in the array.
import org.apache.spark.sql.types._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", ArrayType(IntegerType)):: Nil) val helloDF = sqlContext.read .format("io.github.spark_redshift_community.spark.redshift") .option("url", jdbcURL ) .option("tempdir", tempS3Dir) .option("dbtable", "contains_super") .schema(schema) .load().selectExpr("a[0]")
Limitations
Using complex data types with the spark connector has the following limitations:
-
All nested struct field names and map keys must be lowercase. If querying for complex field names with uppercase letters, you can try omitting the schema and using the
from_json
spark function to convert the returned string locally as a workaround. -
Any map fields used in read or write operations must have only
StringType
keys. -
Only
CSV
,CSV GZIP
, andPARQUET
are supported tempformat values for writing complex types to Redshift. Attempting to useAVRO
will throw an exception.