Creating an external table in Hive
In Tutorial: Working with Amazon DynamoDB and Apache Hive, you created an external Hive table that mapped to a DynamoDB table. When you issued HiveQL statements against the external table, the read and write operations were passed through to the DynamoDB table.
You can think of an external table as a pointer to a data source that is managed and
stored elsewhere. In this case, the underlying data source is a DynamoDB table. (The table
must already exist. You cannot create, update, or delete a DynamoDB table from within
Hive.) You use the CREATE EXTERNAL TABLE
statement to create the external
table. After that, you can use HiveQL to work with data in DynamoDB, as if that data were
stored locally within Hive.
Note
You can use INSERT
statements to insert data into an external table
and SELECT
statements to select data from it. However, you cannot use
UPDATE
or DELETE
statements to manipulate data in the
table.
If you no longer need the external table, you can remove it using the DROP
TABLE
statement. In this case, DROP TABLE
only removes the
external table in Hive. It does not affect the underlying DynamoDB table or any of its
data.
CREATE EXTERNAL TABLE syntax
The following shows the HiveQL syntax for creating an external Hive table that maps to a DynamoDB table:
CREATE EXTERNAL TABLE
hive_table
(hive_column1_name
hive_column1_datatype
,hive_column2_name
hive_column2_datatype
...) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ( "dynamodb.table.name" ="dynamodb_table"
, "dynamodb.column.mapping" = "hive_column1_name
:dynamodb_attribute1_name
,hive_column2_name
:dynamodb_attribute2_name
..." );
Line 1 is the start of the CREATE EXTERNAL TABLE
statement, where you
provide the name of the Hive table (hive_table) you want to
create.
Line 2 specifies the columns and data types for hive_table. You need to define columns and data types that correspond to the attributes in the DynamoDB table.
Line 3 is the STORED BY
clause, where you specify a class that
handles data management between the Hive and the DynamoDB table. For DynamoDB,
STORED BY
should be set to
'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
.
Line 4 is the start of the TBLPROPERTIES
clause, where you define the
following parameters for DynamoDBStorageHandler
:
-
dynamodb.table.name
—the name of the DynamoDB table. -
dynamodb.column.mapping
—pairs of column names in the Hive table and their corresponding attributes in the DynamoDB table. Each pair is of the form hive_column_name:dynamodb_attribute_name, and the pairs are separated by commas.
Note the following:
-
The name of the Hive table name does not have to be the same as the DynamoDB table name.
-
The Hive table column names do not have to be the same as those in the DynamoDB table.
-
The table specified by
dynamodb.table.name
must exist in DynamoDB. -
For
dynamodb.column.mapping
:-
You must map the key schema attributes for the DynamoDB table. This includes the partition key and the sort key (if present).
-
You do not have to map the non-key attributes of the DynamoDB table. However, you will not see any data from those attributes when you query the Hive table.
-
If the data types of a Hive table column and a DynamoDB attribute are incompatible, you will see
NULL
in these columns when you query the Hive table.
-
Note
The CREATE EXTERNAL TABLE
statement does not perform any
validation on the TBLPROPERTIES
clause. The values you provide for
dynamodb.table.name
and dynamodb.column.mapping
are only evaluated by the DynamoDBStorageHandler
class when you
attempt to access the table.
Data type mappings
The following table shows DynamoDB data types and compatible Hive data types:
DynamoDB Data Type | Hive Data Type |
---|---|
String |
|
Number |
|
Binary |
|
String Set |
|
Number Set |
|
Binary Set |
|
Note
The following DynamoDB data types are not supported by the
DynamoDBStorageHandler
class, so they cannot be used with
dynamodb.column.mapping
:
-
Map
-
List
-
Boolean
-
Null
However, if you need to work with these data types, you can create a single
entity called item
that represents the entire DynamoDB item as a map
of strings for both keys and values in the map. For more information, see Copying data
without a column mapping
If you want to map a DynamoDB attribute of type Number, you must choose an appropriate Hive type:
-
The Hive
BIGINT
type is for 8-byte signed integers. It is the same as thelong
data type in Java. -
The Hive
DOUBLE
type is for 8-bit double precision floating point numbers. It is the same as thedouble
type in Java.
If you have numeric data stored in DynamoDB that has a higher precision than the Hive data type you choose, then accessing the DynamoDB data could cause a loss of precision.
If you export data of type Binary from DynamoDB to (Amazon S3) or HDFS, the data is stored as a Base64-encoded string. If you import data from Amazon S3 or HDFS into the DynamoDB Binary type, you must ensure the data is encoded as a Base64 string.