On Mon, Apr 22, 2019 at 10:07:52AM +0200, Gaetano Mendola wrote:
Batch splitting shouldn't be followed by a hash function change?
What would be the value? That can help with hash collisions, but that's
not the issue with the data sets discussed in this thread. The issue
reported originally is about underestimates, and the sample data set has
a large number of duplicate values (a single value representing ~10% of
the data set). Neither of those issues is about hash collisions.
The data set I used to demonstrate how the algorithms work is pretty
perfect, with uniform distribution and no hash collisions.
Furthermore, I don't think we can just change the hash function, for a
couple of technical reasons.
Firstly, it's not like we totally redistribute the whole dataset from N
old batches to (2*N) new ones. By using the same 32-bit hash value and
cconsidering one extra bit, the tuples either stay in the same batch
(when the new bit is 0) or move to a single new batch (when it's 1). So
each batch is split in 1/2. By changing the hash function this would no
longer be true, and we'd redistribute pretty much the whole data set.
The other issue is even more significant - we don't redistribute the
tuples immediately. We only redistribute the current batch, but leave
the other batches alone and handle them when we actually get to them.
This is possible, because the tuples never move backwards - when
splitting batch K, the tuples either stay in K or move to 2K. Or
something like that, I'm too lazy to recall the exact formula now.
And if I recall correctly, I think we can increment the number of
batches while already performing the join, after some rows were already
processed. That would probably be no longer true if we just switched the
hash function, because it might move rows backwards (to the already
processed region).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services