Search Postgresql Archives

Re: Perfomance of IN-clause with many elements and possible solutions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"David G. Johnston" <david.g.johnston@xxxxxxxxx> writes:
> On Mon, Jul 24, 2017 at 3:46 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> The cost to form the inner hash is basically negligible whether it's
>> de-duped or not, but if it's not (known) de-duped then the cost
>> estimate for the semijoin is going to rise some, and that discourages
>> selecting it.

> ​Why does the "hash semi join" care about duplication of values on the
> inner relation?  Doesn't it only care whether a given bucket exists
> irrespective of its contents?

No, it cares about the average bucket size, ie number of entries that
a probe will have to look at.  The non-de-duped inner relation can't
be assumed to have a flat distribution among the buckets.

> Pointing me to the readme or code file (comments) that explains this in
> more detail would be welcome.

Look for estimate_hash_bucketsize in selfuncs.c and its caller in
costsize.c.  The key point is this argument in that function's
header comment:

 * We are passed the number of buckets the executor will use for the given
 * input relation.  If the data were perfectly distributed, with the same
 * number of tuples going into each available bucket, then the bucketsize
 * fraction would be 1/nbuckets.  But this happy state of affairs will occur
 * only if (a) there are at least nbuckets distinct data values, and (b)
 * we have a not-too-skewed data distribution.  Otherwise the buckets will
 * be nonuniformly occupied.  If the other relation in the join has a key
 * distribution similar to this one's, then the most-loaded buckets are
 * exactly those that will be probed most often.  Therefore, the "average"
 * bucket size for costing purposes should really be taken as something close
 * to the "worst case" bucket size.  We try to estimate this by adjusting the
 * fraction if there are too few distinct data values, and then scaling up
 * by the ratio of the most common value's frequency to the average frequency.
 *
 * If no statistics are available, use a default estimate of 0.1.  This will
 * discourage use of a hash rather strongly if the inner relation is large,
 * which is what we want.  We do not want to hash unless we know that the
 * inner rel is well-dispersed (or the alternatives seem much worse).

That's certainly a bit pessimistic, but the thing to remember about any
hashing algorithm is that occasionally it will eat your lunch.  We prefer
to go to sort-based algorithms if there seems to be a risk of the hash
degenerating to O(N^2).

			regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux