Spark 2.2 Count Table Multiple Time Issue
Problem
Spark 2.2 Count Table Multiple Time returns the same value regardless of table actual state
Description
There is table A, initially have 100 records
Loop:
(1) int countBefore = Spark table A count
(2) //Remove 2 records from table A
(3) int countAfter = Spark table A count
* Note that (1) is tried with both session.sql(select count(*)) and session.sql(select *).count()
Behavior
1st iteration: countBefore = 100, countAfter = 98
2nd iteration: countBefore = 100, countAfter = 96
Explanation and Solution
Look like the count is cached for optimization, although the variable countBefore declared inside the loop iteration. The scope concept does not hold.
Solution is session.catalog.clearCache() in the beginning of any iteration.
The other attempt using session.catalog.refreshTable(tableA) does not solve the problem.
Spark 2.2 Count Table Multiple Time returns the same value regardless of table actual state
Description
There is table A, initially have 100 records
Loop:
(1) int countBefore = Spark table A count
(2) //Remove 2 records from table A
(3) int countAfter = Spark table A count
* Note that (1) is tried with both session.sql(select count(*)) and session.sql(select *).count()
Behavior
1st iteration: countBefore = 100, countAfter = 98
2nd iteration: countBefore = 100, countAfter = 96
Explanation and Solution
Look like the count is cached for optimization, although the variable countBefore declared inside the loop iteration. The scope concept does not hold.
Solution is session.catalog.clearCache() in the beginning of any iteration.
The other attempt using session.catalog.refreshTable(tableA) does not solve the problem.
Comments
Post a Comment