Spark 1.6: The table metadata problem in Spark 1.6

- January 23, 2018

Problem Description

In Jan 2018, this is what I was trying to do with Spark 1.6

There is an external Hive ORC table, partitioned on region code and country code.
I read the whole table to a Spark dataFrame.
I took some partitions, let say regionCode=01 and countryCode=01, make change to it to make it regionCode=02 and countryCode=02
I wrote the new change into a tem file. A new change means all records which used to have regionCode=01 and countryCode=01 and now have regionCode=02 and countryCode=02. The write finished successfully and verified after writting.
Then I deleted the old partition regionCode=01 and countryCode=01. By "delete", I mean call Hive ALTER TABLE DROP PARTITION(...) and deleted files and folders of that partition in HDFS.
Now I tried to read the tem file and write it back to Hive table using dataFrame.write().

Error! Saying regionCode=01 and countryCode=01 folder was not found.

Yes because I removed it from HDFS. But I did also remove partition information in metadata by running ALTER TABLE DROP PARTITION(). I tried SHOW PARTITIONS in Hive, that partition is gone.

Fine. Reason is metadata not refreshed at the time I write back and it looks for the old partition. I got to refresh metadata.

dataFrame.refreshTable() does not work. Metadata not refreshed. Same error.

Workaround

Try 1: Use dataFrame.clearCache() workds for this case.
Metadata got refreshed and does not look for old partition. However all cache got cleared.

Try 2: Modify HiveMap (not yet tried)

Search This Blog

Dudes Study Note

Spark 1.6: The table metadata problem in Spark 1.6

Comments

Post a Comment

Popular posts from this blog

Java Object Size and Overhead

JVM Garbage Collection (GC)

Java Synchronization: volatile