Posts

Showing posts from January, 2018

JVM Garbage Collection (GC)

Image
Java Heap Memory The Young Generation is where all new objects are allocated and aged. Tenure Generation stores old objects, which initially in young generation space, aged and moved to. Permanent Generation Space stores metadata required by JVM to describe classes and methods used in the application. JVM Garbage Collection  Minor garbage collection  (quick, stop-the-world) All objects started at young generation space, Eden space. When Eden space is full, GC runs in Eden Space, selecting all survivor objects and move them to Survivor Space 0. When Eden space full again, GC runs in Eden Space and Survivor Space 0, selecting survivor and move them to Survivor Space 1. When Eden space full again, GC runs in Eden Space and Survivor Space 1, selecting survivor and move them to Survivor Space 0.So and so. Every time an object moved from one survivor space to another survivor space, they age. When their age reach 8 (JVM 8), they are moved to Tenure Generation Space. G

Java Hotspot Compiler Brief

Image
1)       Other name: JIT Compiler, Dynamic Compiler 2)       How a java program runs in JVM? -           javac: translate java code to byte code -           JVM Interpreter runs instruction by instruction -           In JVM, there is a lightweight profiler detects common block of instructions running (for example loop) and try to optimize the execution ON-FLY, immediate speed up. For example: we run 5*5*5 for 1000 time, it gets faster after some runs. Why do I care?       You may run bench-marking test. The result will come out funny and you will wonder why.

Spark 1.6: The table metadata problem in Spark 1.6

 Problem Description In Jan 2018, this is what I was trying to do with Spark 1.6 There is an external Hive ORC table, partitioned on region code and country code. I read the whole table to a Spark dataFrame.  I took some partitions, let say regionCode=01 and countryCode=01, make change to it to make it regionCode=02 and countryCode=02 I wrote the new change into a tem file. A new change means all records which used to have regionCode=01 and countryCode=01 and now have regionCode=02 and countryCode=02. The write finished successfully and verified after writting. Then I deleted the old partition regionCode=01 and countryCode=01. By "delete", I mean call Hive ALTER TABLE DROP PARTITION(...) and deleted files and folders of that partition in HDFS. Now I tried to read the tem file and write it back to Hive table using dataFrame.write().  Error! Saying regionCode=01 and countryCode=01 folder was not found.  Yes because I removed it from HDFS. But I did also remove

Java Synchronization: The Reader

Image
Which methods need synchronized? Consider the following class stub. At this point there is no doubt we need to give "synchronized" to add() and remove(). How about get()? get() just does a read. Does it need a lock?  Method like get() needs synchronized too 1) Avoid inconsistent data get() should not look inside structure while some other threads are making multiple operator change because it could be inconsistent. 2) To get latest data from memory, not local cache The following image describing cache system of a multi-core processor.  As you can see each core has its own cache. The only shared memory between cores are RAM (memory).  Let say the original value if 1|2|3.  Processor 0 Core 0 running thread 1. Reading. Processor 1 Core 1 running thread 2. Writing. Thread 2 makes change to value to make it 1|2|3|4. Since add() is synchronized, the new value got added in P1C1 cache level 1, push to cache level 2,

Java Synchronization: The Lock

Where is the lock? Consider the following use case: I have class A which have a static variable i. I have method m1() which increase value of i by one everything it is called.  In a single thread system, it is simple. In multiple-thread system, value of i does not get set properly sometimes. Try to get the following code run by 2 threads a few times, we get inconsistent last value of i and some of them does not even get the last value to be 200.   public class A { static int i; public void m1() { for (int j = 0; j < 100; j++) { i++; System.out.println(i); } } } Why? Because i++ is not an atomic instruction. If we translate it to byte code, it will be something like: 1 - Read var = i 2 - Increase var by 1 3 - i = var Let say current value of i = 6. Thread 1 is trying to make it to 7. The first step done var = 7. Increase var by 1 so var = 8. Before thread 1 got chance to write back, thread 2 comes in and read i = 7. Thread 2 increases it by 1 to

Java Synchronization: Keyword “synchronized”

Image
What does it mean by “synchronized”? 1) Mutual Exclusive Only one thread can enter synchronized block. 2) Guarantee no reordering Instruction inside synchronized block can reordered between themselves. E.g. 4, 6, 5 Instruction outside synchronized block can reorder between themselves. E.g. 2, 3 ,1. Instruction in synchronized block cannot be reordered to mix with outside block (1, 5, 2, 3) Instruction outside synchronized block cannot be reordered to mix with in block (7, 5, 8, 9) 3) Guarantee to get value from memory all the time, not from cache 4) Volatile variable observe next page (need validation) After all instruction in a synchronized block get executed. All change values are pushed into memory from local cache of the thread. Also as we know data move does not work on single variable, therefore the cache line which contains the changed value got pushed to memory too. Cost of the lock in "synchronized" 1) Memory Synchronization 2) Mutual Exclusi

2018 Concurrency Training Note

2018 Concurrency Training Node Hotspot Compiler Brief Java Synchronization: Keyword Synchronize Java Synchronization: The Lock Java Synchronization: The Reader Volatile