Re: Perfomance of IN-clause with many elements and possible solutions

Tom Lane <tgl@xxxxxxxxxxxxx> · Mon, 24 Jul 2017 23:11:09 -0400

"David G. Johnston" <david.g.johnston@xxxxxxxxx> writes:
> On Mon, Jul 24, 2017 at 3:46 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> The cost to form the inner hash is basically negligible whether it's
>> de-duped or not, but if it's not (known) de-duped then the cost
>> estimate for the semijoin is going to rise some, and that discourages
>> selecting it.

> Why does the "hash semi join" care about duplication of values on the
> inner relation?  Doesn't it only care whether a given bucket exists
> irrespective of its contents?

No, it cares about the average bucket size, ie number of entries that
a probe will have to look at.  The non-de-duped inner relation can't
be assumed to have a flat distribution among the buckets.

> Pointing me to the readme or code file (comments) that explains this in
> more detail would be welcome.

Look for estimate_hash_bucketsize in selfuncs.c and its caller in
costsize.c.  The key point is this argument in that function's
header comment:

 * We are passed the number of buckets the executor will use for the given
 * input relation.  If the data were perfectly distributed, with the same
 * number of tuples going into each available bucket, then the bucketsize
 * fraction would be 1/nbuckets.  But this happy state of affairs will occur
 * only if (a) there are at least nbuckets distinct data values, and (b)
 * we have a not-too-skewed data distribution.  Otherwise the buckets will
 * be nonuniformly occupied.  If the other relation in the join has a key
 * distribution similar to this one's, then the most-loaded buckets are
 * exactly those that will be probed most often.  Therefore, the "average"
 * bucket size for costing purposes should really be taken as something close
 * to the "worst case" bucket size.  We try to estimate this by adjusting the
 * fraction if there are too few distinct data values, and then scaling up
 * by the ratio of the most common value's frequency to the average frequency.
 *
 * If no statistics are available, use a default estimate of 0.1.  This will
 * discourage use of a hash rather strongly if the inner relation is large,
 * which is what we want.  We do not want to hash unless we know that the
 * inner rel is well-dispersed (or the alternatives seem much worse).

That's certainly a bit pessimistic, but the thing to remember about any
hashing algorithm is that occasionally it will eat your lunch.  We prefer
to go to sort-based algorithms if there seems to be a risk of the hash
degenerating to O(N^2).

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general