Spark 2.2 Count Table Multiple Time Issue

Problem
Spark 2.2 Count Table Multiple Time returns the same value regardless of table actual state

Description
There is table A, initially have 100 records

Loop:
(1) int countBefore = Spark table A count
(2) //Remove 2 records from table A
(3) int countAfter = Spark table A count

* Note that (1) is tried with both session.sql(select count(*)) and session.sql(select *).count()

Behavior
1st iteration: countBefore = 100, countAfter = 98
2nd iteration: countBefore = 100, countAfter = 96

Explanation and Solution
Look like the count is cached for optimization, although the variable countBefore declared inside the loop iteration. The scope concept does not hold.
Solution is session.catalog.clearCache() in the beginning of any iteration.
The  other attempt using session.catalog.refreshTable(tableA) does not solve the problem.

Comments

Popular posts from this blog

test

Java Concurrency: Final

Java Object Size and Overhead