Re: Understanding bad estimate (related to FKs?)

Philip Semanchuk <philip@xxxxxxxxxxxxxxxxxxxxx> · Fri, 30 Oct 2020 10:56:54 -0400

> On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> wrote:
> 
> On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:
>> 
>> 
>>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby@xxxxxxxxxxxxx>
>>> wrote:
>>> 
>>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
>>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
>>>> <philip@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>> 
>>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows,
>>>>>>> actual
>>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in
>>>>> this node.
>>>>>>> 
>>>>>>> The first is Postgres' estimate. The condition for this index
>>>>>>> scan
>>>>> contains three expressions --
>>>>>>> 
>>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
>>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text)
>>>>> 
>>>> 
>>>> Are the columns correlated?
>>> 
>>> I guess it shouldn't matter, since the FKs should remove all but one
>>> of the conditions.
>> 
>> Yes, I had the same expectation. I thought Postgres would calculate the
>> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
>> of ‘mike’, but since the frequency estimate is very accurate but the
>> planner’s estimate is not, there’s something else going on.
>> 
> 
> Well, this is quite a bit more complicated, I'm afraid :-( The clauses
> include parameters passed from the nodes above the index scan. So even
> if we had extended stats on the table, we couldn't use them as that
> requires (Var op Const) conditions. So this likely ends up with a
> product of estimates for each clause, and even then we can't use any
> particular value so we probably end up with something like 1/ndistinct
> or something like that. So if the values actually passed to the index
> scan are more common and/or if the columns are somehow correlated, it's
> not surprising we end up with an overestimate.

I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in the source table, so we have: 

25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209

Hence the estimate of 7 rows returned.

It's interesting that five_uniform’s estimated ndistinct is low by > 50% (actual = 62958). Paradoxically, if I manually set ndistinct to the correct value of 62958, the estimate gets worse (3 rows instead of 7). 

Suggestions for fixing this are of course welcome. :-)

On a related topic, are there any in depth guides to the planner that I could read? I can (and have) read the source code and it’s been informative, but something higher level than the source code would help.

Thanks so much
Philip