Posts

Spark 2.2 Count Table Multiple Time Issue

Problem Spark 2.2 Count Table Multiple Time returns the same value regardless of table actual state Description There is table A, initially have 100 records Loop: (1) int countBefore = Spark table A count (2) //Remove 2 records from table A (3) int countAfter = Spark table A count * Note that (1) is tried with both session.sql(select count(*)) and session.sql(select *).count() Behavior 1st iteration: countBefore = 100, countAfter = 98 2nd iteration: countBefore = 100, countAfter = 96 Explanation and Solution Look like the count is cached for optimization, although the variable countBefore declared inside the loop iteration. The scope concept does not hold. Solution is session.catalog.clearCache() in the beginning of any iteration. The  other attempt using session.catalog.refreshTable(tableA) does not solve the problem.

Java Concurrency: Final

Image
What does it mean by FINAL? (Definition below only applicable to JDK6+) 1) Address of enclosing object not allowed to escape util final variable for initialized and change made to memory done 2) JVM can execute special CPU caching instruction 3) All change in constructor to final variable, even reference will be pushed to memory

Java Synchronization: volatile

Image
What does it mean by "volatile"? New JMM (1.5+) 1) Volatile variable reads from memory and write to memory instead of local cache Picture below shows example of a multi core system. Each CPU  has cache, which is super fast memory locally for that CPU. For optimization purpose, many information stored in cache. But in multi threaded application which shared variable used, if each CPU update and fetch info from its own cache, value of the shared variable is wrong. Example: a counter By using volatile, every time volatile variable read, it is fetched from memory. Every time volatile variable written, it is pushed from L1 to L2 to memory. 2) Instruction which use volatile variable cannot be reordered 3) Volatile variable observes what happened  For example, old value of w = 0 , x = 0 , f = true. f is a volatile variable. Now CPU1 updates x = 2 and f = false. As f is volatile, it is flushed to RAM. x is not volatile, however f observes that x changes as well, so...

Java Object Size and Overhead

Let's look at the details few details of object header and calculate the memory size an object occupies inside JVM Heap. Each Object contains following information. • The Object Header. • The memory for primitive types. 
 • The memory for reference types. 
 • Offset / alignment -  in fact, these are a few unused bytes that are placed after the data object itself. This is done in order that an address in memory was always a multiple of machine word, to speed up the memory read + reduce the number of bits for a pointer to an object. It is also worth noting that in java a size of any object is multiple of 8 bytes! •   Object Header :             In case of 32-bit system, the header size is 8 bytes, in the case of 64-bit system, respectively is 16 bytes. It contains following information. 1   Hash Code - 2   Garbage Collection Information  -  each java object contains the i...

JVM Garbage Collection (GC)

Image
Java Heap Memory The Young Generation is where all new objects are allocated and aged. Tenure Generation stores old objects, which initially in young generation space, aged and moved to. Permanent Generation Space stores metadata required by JVM to describe classes and methods used in the application. JVM Garbage Collection  Minor garbage collection  (quick, stop-the-world) All objects started at young generation space, Eden space. When Eden space is full, GC runs in Eden Space, selecting all survivor objects and move them to Survivor Space 0. When Eden space full again, GC runs in Eden Space and Survivor Space 0, selecting survivor and move them to Survivor Space 1. When Eden space full again, GC runs in Eden Space and Survivor Space 1, selecting survivor and move them to Survivor Space 0.So and so. Every time an object moved from one survivor space to another survivor space, they age. When their age reach 8 (JVM 8), they are moved to Tenure Generat...

Java Hotspot Compiler Brief

Image
1)       Other name: JIT Compiler, Dynamic Compiler 2)       How a java program runs in JVM? -           javac: translate java code to byte code -           JVM Interpreter runs instruction by instruction -           In JVM, there is a lightweight profiler detects common block of instructions running (for example loop) and try to optimize the execution ON-FLY, immediate speed up. For example: we run 5*5*5 for 1000 time, it gets faster after some runs. Why do I care?       You may run bench-marking test. The result will come out funny and you will wonder why.

Spark 1.6: The table metadata problem in Spark 1.6

 Problem Description In Jan 2018, this is what I was trying to do with Spark 1.6 There is an external Hive ORC table, partitioned on region code and country code. I read the whole table to a Spark dataFrame.  I took some partitions, let say regionCode=01 and countryCode=01, make change to it to make it regionCode=02 and countryCode=02 I wrote the new change into a tem file. A new change means all records which used to have regionCode=01 and countryCode=01 and now have regionCode=02 and countryCode=02. The write finished successfully and verified after writting. Then I deleted the old partition regionCode=01 and countryCode=01. By "delete", I mean call Hive ALTER TABLE DROP PARTITION(...) and deleted files and folders of that partition in HDFS. Now I tried to read the tem file and write it back to Hive table using dataFrame.write().  Error! Saying regionCode=01 and countryCode=01 folder was not found.  Yes because I removed it from HDFS. But I did als...