> On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> wrote: > > On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote: >> >> >>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby@xxxxxxxxxxxxx> >>> wrote: >>> >>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote: >>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk >>>> <philip@xxxxxxxxxxxxxxxxxxxxx> wrote: >>>> >>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows, >>>>>>> actual >>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in >>>>> this node. >>>>>>> >>>>>>> The first is Postgres' estimate. The condition for this index >>>>>>> scan >>>>> contains three expressions -- >>>>>>> >>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike = >>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text) >>>>> >>>> >>>> Are the columns correlated? >>> >>> I guess it shouldn't matter, since the FKs should remove all but one >>> of the conditions. >> >> Yes, I had the same expectation. I thought Postgres would calculate the >> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency >> of ‘mike’, but since the frequency estimate is very accurate but the >> planner’s estimate is not, there’s something else going on. >> > > Well, this is quite a bit more complicated, I'm afraid :-( The clauses > include parameters passed from the nodes above the index scan. So even > if we had extended stats on the table, we couldn't use them as that > requires (Var op Const) conditions. So this likely ends up with a > product of estimates for each clause, and even then we can't use any > particular value so we probably end up with something like 1/ndistinct > or something like that. So if the values actually passed to the index > scan are more common and/or if the columns are somehow correlated, it's > not surprising we end up with an overestimate. I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in the source table, so we have: 25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209 Hence the estimate of 7 rows returned. It's interesting that five_uniform’s estimated ndistinct is low by > 50% (actual = 62958). Paradoxically, if I manually set ndistinct to the correct value of 62958, the estimate gets worse (3 rows instead of 7). Suggestions for fixing this are of course welcome. :-) On a related topic, are there any in depth guides to the planner that I could read? I can (and have) read the source code and it’s been informative, but something higher level than the source code would help. Thanks so much Philip