"David G. Johnston" <david.g.johnston@xxxxxxxxx> writes: > On Mon, Jul 24, 2017 at 3:46 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> The cost to form the inner hash is basically negligible whether it's >> de-duped or not, but if it's not (known) de-duped then the cost >> estimate for the semijoin is going to rise some, and that discourages >> selecting it. > Why does the "hash semi join" care about duplication of values on the > inner relation? Doesn't it only care whether a given bucket exists > irrespective of its contents? No, it cares about the average bucket size, ie number of entries that a probe will have to look at. The non-de-duped inner relation can't be assumed to have a flat distribution among the buckets. > Pointing me to the readme or code file (comments) that explains this in > more detail would be welcome. Look for estimate_hash_bucketsize in selfuncs.c and its caller in costsize.c. The key point is this argument in that function's header comment: * We are passed the number of buckets the executor will use for the given * input relation. If the data were perfectly distributed, with the same * number of tuples going into each available bucket, then the bucketsize * fraction would be 1/nbuckets. But this happy state of affairs will occur * only if (a) there are at least nbuckets distinct data values, and (b) * we have a not-too-skewed data distribution. Otherwise the buckets will * be nonuniformly occupied. If the other relation in the join has a key * distribution similar to this one's, then the most-loaded buckets are * exactly those that will be probed most often. Therefore, the "average" * bucket size for costing purposes should really be taken as something close * to the "worst case" bucket size. We try to estimate this by adjusting the * fraction if there are too few distinct data values, and then scaling up * by the ratio of the most common value's frequency to the average frequency. * * If no statistics are available, use a default estimate of 0.1. This will * discourage use of a hash rather strongly if the inner relation is large, * which is what we want. We do not want to hash unless we know that the * inner rel is well-dispersed (or the alternatives seem much worse). That's certainly a bit pessimistic, but the thing to remember about any hashing algorithm is that occasionally it will eat your lunch. We prefer to go to sort-based algorithms if there seems to be a risk of the hash degenerating to O(N^2). regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general