MSCK REPAIR TABLE
Use the MSCK REPAIR TABLE
command to update the metadata in the catalog after
you add Hive compatible partitions.
The MSCK REPAIR TABLE
command scans a file system such as Amazon S3 for Hive
compatible partitions that were added to the file system after the table was created.
MSCK REPAIR TABLE
compares the partitions in the table metadata and the
partitions in S3. If new partitions are present in the S3 location that you specified when
you created the table, it adds those partitions to the metadata and to the Athena
table.
When you add physical partitions, the metadata in the catalog becomes inconsistent with
the layout of the data in the file system, and information about the new partitions needs to
be added to the catalog. To update the metadata, run MSCK REPAIR TABLE
so that
you can query the data in the new partitions from Athena.
Note
MSCK REPAIR TABLE
only adds partitions to metadata; it does not remove
them. To remove partitions from metadata after the partitions have been manually deleted
in Amazon S3, run the command ALTER TABLE
. For more information see ALTER TABLE DROP
PARTITION.
table-name
DROP
PARTITION
Considerations and limitations
When using MSCK REPAIR TABLE
, keep in mind the following points:
-
It is possible it will take some time to add all partitions. If this operation times out, it will be in an incomplete state where only a few partitions are added to the catalog. You should run
MSCK REPAIR TABLE
on the same table until all partitions are added. For more information, see Partition your data. -
For partitions that are not compatible with Hive, use ALTER TABLE ADD PARTITION to load the partitions so that you can query their data.
-
Partition locations to be used with Athena must use the
s3
protocol (for example,s3://amzn-s3-demo-bucket/
). In Athena, locations that use other protocols (for example,folder
/s3a://
) will result in query failures whenbucket
/folder
/MSCK REPAIR TABLE
queries are run on the containing tables. -
Because
MSCK REPAIR TABLE
scans both a folder and its subfolders to find a matching partition scheme, be sure to keep data for separate tables in separate folder hierarchies. For example, suppose you have data for table 1 ins3://amzn-s3-demo-bucket1
and data for table 2 ins3://amzn-s3-demo-bucket1/table-2-data
. If both tables are partitioned by string,MSCK REPAIR TABLE
will add the partitions for table 2 to table 1. To avoid this, use separate folder structures likes3://amzn-s3-demo-bucket1
ands3://amzn-s3-demo-bucket2
instead. Note that this behavior is consistent with Amazon EMR and Apache Hive. -
Due to a known issue,
MSCK REPAIR TABLE
fails silently when partition values contain a colon (:
) character (for example, when the partition value is a timestamp). As a workaround, use ALTER TABLE ADD PARTITION. -
MSCK REPAIR TABLE
does not add partition column names that begin with an underscore (_). To work around this limitation, use ALTER TABLE ADD PARTITION.
Synopsis
MSCK REPAIR TABLE table_name
Examples
MSCK REPAIR TABLE orders;
Troubleshooting
After you run MSCK REPAIR TABLE
, if Athena does not add the partitions to
the table in the AWS Glue Data Catalog, check the following:
-
AWS Glue access – Make sure that the AWS Identity and Access Management (IAM) role has a policy that allows the
glue:BatchCreatePartition
action. For more information, see Allow glue:BatchCreatePartition in the IAM policy later in this document. -
Amazon S3 access – Make sure that the role has a policy with sufficient permissions to access Amazon S3, including the
s3:DescribeJob
action. For an example of which Amazon S3 actions to allow, see the example bucket policy in Cross-account access in Athena to Amazon S3 buckets. -
Amazon S3 object key casing – Make sure that the Amazon S3 path is in lower case instead of camel case (for example,
userid
instead ofuserId
), or useALTER TABLE ADD PARTITION
to specify the object key names. For more information, see Change or redefine the Amazon S3 path later in this document. -
Query timeouts –
MSCK REPAIR TABLE
is best used when creating a table for the first time or when there is uncertainty about parity between data and partition metadata. If you useMSCK REPAIR TABLE
to add new partitions frequently (for example, on a daily basis) and are experiencing query timeouts, consider using ALTER TABLE ADD PARTITION. -
Partitions missing from file system – If you delete a partition manually in Amazon S3 and then run
MSCK REPAIR TABLE
, you may receive the error messagePartitions missing from filesystem
. This occurs becauseMSCK REPAIR TABLE
doesn't remove stale partitions from table metadata. To remove the deleted partitions from table metadata, run ALTER TABLE DROP PARTITION instead. Note that SHOW PARTITIONS similarly lists only the partitions in metadata, not the partitions in the file system. -
"NullPointerException name is null" error
If you use the AWS Glue CreateTable API operation or the AWS CloudFormation
AWS::Glue::Table
template to create a table for use in Athena without specifying theTableType
property and then run a DDL query likeSHOW CREATE TABLE
orMSCK REPAIR TABLE
, you can receive the error messageFAILED: NullPointerException Name is null
.To resolve the error, specify a value for the TableInput
TableType
attribute as part of the AWS GlueCreateTable
API call or AWS CloudFormation template. Possible values forTableType
includeEXTERNAL_TABLE
orVIRTUAL_VIEW
.This requirement applies only when you create a table using the AWS Glue
CreateTable
API operation or theAWS::Glue::Table
template. If you create a table for Athena by using a DDL statement or an AWS Glue crawler, theTableType
property is defined for you automatically.
The following sections provide some additional detail.
Allow glue:BatchCreatePartition in the IAM policy
Review the IAM policies attached to the role that you're using to run MSCK
REPAIR TABLE
. When you use the AWS Glue Data Catalog with Athena, the IAM
policy must allow the glue:BatchCreatePartition
action. For an example
of an IAM policy that allows the glue:BatchCreatePartition
action,
see AWS managed policy:
AmazonAthenaFullAccess.
Change or redefine the Amazon S3 path
If one or more object keys in the Amazon S3 path are in camel case instead of lower
case, MSCK REPAIR TABLE
might not add the partitions to the
AWS Glue Data Catalog. For example, if your Amazon S3 path includes the object key name
userId
, the following partitions might not be added to the
AWS Glue Data Catalog:
s3://amzn-s3-demo-bucket/path/userId=1/ s3://amzn-s3-demo-bucket/path/userId=2/ s3://amzn-s3-demo-bucket/path/userId=3/
To resolve this issue, do one of the following:
-
Use lower case instead of camel case when you create your Amazon S3 object keys:
s3://amzn-s3-demo-bucket/path/userid=1/ s3://amzn-s3-demo-bucket/path/userid=2/ s3://amzn-s3-demo-bucket/path/userid=3/
-
Use ALTER TABLE ADD PARTITION to redefine the location, as in the following example:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION (userId=1) LOCATION 's3://amzn-s3-demo-bucket/path/userId=1/' PARTITION (userId=2) LOCATION 's3://amzn-s3-demo-bucket/path/userId=2/' PARTITION (userId=3) LOCATION 's3://amzn-s3-demo-bucket/path/userId=3/'
Note that although Amazon S3 object key names can use upper case, Amazon S3 bucket names themselves must always be in lower case. For more information, see Object key naming guidelines and Bucket naming rules in the Amazon S3 User Guide.