Spark 1.6: The table metadata problem in Spark 1.6

 Problem Description

In Jan 2018, this is what I was trying to do with Spark 1.6

  1. There is an external Hive ORC table, partitioned on region code and country code.
  2. I read the whole table to a Spark dataFrame. 
  3. I took some partitions, let say regionCode=01 and countryCode=01, make change to it to make it regionCode=02 and countryCode=02
  4. I wrote the new change into a tem file. A new change means all records which used to have regionCode=01 and countryCode=01 and now have regionCode=02 and countryCode=02. The write finished successfully and verified after writting.
  5. Then I deleted the old partition regionCode=01 and countryCode=01. By "delete", I mean call Hive ALTER TABLE DROP PARTITION(...) and deleted files and folders of that partition in HDFS.
  6. Now I tried to read the tem file and write it back to Hive table using dataFrame.write(). 
Error! Saying regionCode=01 and countryCode=01 folder was not found. 


Yes because I removed it from HDFS. But I did also remove partition information in metadata by running ALTER TABLE DROP PARTITION(). I tried SHOW PARTITIONS in Hive, that partition is gone. 
Fine. Reason is metadata not refreshed at the time I write back and it looks for the old partition. I got to refresh metadata. 

dataFrame.refreshTable() does not work. Metadata not refreshed. Same error.

Workaround

Try 1: Use dataFrame.clearCache() workds for this case. 
Metadata got refreshed and does not look for old partition. However all cache got cleared. 

Try 2: Modify HiveMap (not yet tried)

Comments

Popular posts from this blog

JVM Garbage Collection (GC)