Posts

Showing posts from March, 2018

Spark 2.2 Count Table Multiple Time Issue

Problem Spark 2.2 Count Table Multiple Time returns the same value regardless of table actual state Description There is table A, initially have 100 records Loop: (1) int countBefore = Spark table A count (2) //Remove 2 records from table A (3) int countAfter = Spark table A count * Note that (1) is tried with both session.sql(select count(*)) and session.sql(select *).count() Behavior 1st iteration: countBefore = 100, countAfter = 98 2nd iteration: countBefore = 100, countAfter = 96 Explanation and Solution Look like the count is cached for optimization, although the variable countBefore declared inside the loop iteration. The scope concept does not hold. Solution is session.catalog.clearCache() in the beginning of any iteration. The  other attempt using session.catalog.refreshTable(tableA) does not solve the problem.