Re: Perfomance of IN-clause with many elements and possible solutions

"David G. Johnston" <david.g.johnston@xxxxxxxxx> · Mon, 24 Jul 2017 22:09:07 -0700

On Mon, Jul 24, 2017 at 8:11 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
[*docs] If the data were perfectly distributed, with the same

 * number of tuples going into each available bucket, then the bucketsize

 * fraction would be 1/nbuckets.  But this happy state of affairs will occur

 * only if (a) there are at least nbuckets distinct data values, and (b)

 * we have a not-too-skewed data distribution.  Otherwise the buckets will

 * be nonuniformly occupied.

Thanks, I have a better feel now.  Using this example (200 inner relation rows) is pretty poor since at this scale there doesn't seem to be enough data to make a noticeable difference.

But anyway, the above comment is only being applied when dealing with a non-unique inner relation; however, the fraction used is 1/nbuckets for any unique relation regardless of its size.

if (IsA(inner_path, UniquePath))
    innerbucketsize = 1.0 / virtualbuckets;
else

And to clarify for others only reading this...the 200 on the "VALUES" node is there because there are 200 literal values in the value_list.  The 200 on the resulting Hash (and HashAggregate in the example) node is there because of DEFAULT_NUM_DISTINCT (changing the query limit to 300 only changed the former).  Further, since it is only the default, the fraction used charged out is 1/10 instead of 1/200 that would used if the 200 were a real number instead - or 1/1024 if those 200 rows were known to be themselves unique.

For me, I'm seeing that the expected number of input rows doesn't factor into the innerbucketsize computation directly (possibly excepting a scaling factor adjustment).

I can understand better, now, why this seemingly perfect example of a semi-join query gets executed with an extra distinct/grouping node.

David J.