Spark 1.6: The table metadata problem in Spark 1.6
Problem Description
In Jan 2018, this is what I was trying to do with Spark 1.6
In Jan 2018, this is what I was trying to do with Spark 1.6
- There is an external Hive ORC table, partitioned on region code and country code.
- I read the whole table to a Spark dataFrame.
- I took some partitions, let say regionCode=01 and countryCode=01, make change to it to make it regionCode=02 and countryCode=02
- I wrote the new change into a tem file. A new change means all records which used to have regionCode=01 and countryCode=01 and now have regionCode=02 and countryCode=02. The write finished successfully and verified after writting.
- Then I deleted the old partition regionCode=01 and countryCode=01. By "delete", I mean call Hive ALTER TABLE DROP PARTITION(...) and deleted files and folders of that partition in HDFS.
- Now I tried to read the tem file and write it back to Hive table using dataFrame.write().
Yes because I removed it from HDFS. But I did also remove partition information in metadata by running ALTER TABLE DROP PARTITION(). I tried SHOW PARTITIONS in Hive, that partition is gone.
Fine. Reason is metadata not refreshed at the time I write back and it looks for the old partition. I got to refresh metadata.
dataFrame.refreshTable() does not work. Metadata not refreshed. Same error.
Workaround
Try 1: Use dataFrame.clearCache() workds for this case.
Metadata got refreshed and does not look for old partition. However all cache got cleared.
Try 2: Modify HiveMap (not yet tried)
Workaround
Try 1: Use dataFrame.clearCache() workds for this case.
Metadata got refreshed and does not look for old partition. However all cache got cleared.
Try 2: Modify HiveMap (not yet tried)
Comments
Post a Comment