Re: Understanding bad estimate (related to FKs?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Oct 31, 2020, at 9:53 AM, Guillaume Lelarge <guillaume@xxxxxxxxxxxx> wrote:
> 
> Hi,
> 
> Le ven. 30 oct. 2020 à 15:57, Philip Semanchuk <philip@xxxxxxxxxxxxxxxxxxxxx> a écrit :
> 
> 
> > On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:
> >> 
> >> 
> >>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby@xxxxxxxxxxxxx>
> >>> wrote:
> >>> 
> >>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
> >>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
> >>>> <philip@xxxxxxxxxxxxxxxxxxxxx> wrote:
> >>>> 
> >>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows,
> >>>>>>> actual
> >>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in
> >>>>> this node.
> >>>>>>> 
> >>>>>>> The first is Postgres' estimate. The condition for this index
> >>>>>>> scan
> >>>>> contains three expressions --
> >>>>>>> 
> >>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
> >>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text)
> >>>>> 
> >>>> 
> >>>> Are the columns correlated?
> >>> 
> >>> I guess it shouldn't matter, since the FKs should remove all but one
> >>> of the conditions.
> >> 
> >> Yes, I had the same expectation. I thought Postgres would calculate the
> >> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
> >> of ‘mike’, but since the frequency estimate is very accurate but the
> >> planner’s estimate is not, there’s something else going on.
> >> 
> > 
> > Well, this is quite a bit more complicated, I'm afraid :-( The clauses
> > include parameters passed from the nodes above the index scan. So even
> > if we had extended stats on the table, we couldn't use them as that
> > requires (Var op Const) conditions. So this likely ends up with a
> > product of estimates for each clause, and even then we can't use any
> > particular value so we probably end up with something like 1/ndistinct
> > or something like that. So if the values actually passed to the index
> > scan are more common and/or if the columns are somehow correlated, it's
> > not surprising we end up with an overestimate.
> 
> I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in the source table, so we have: 
> 
> 25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209
> 
> Hence the estimate of 7 rows returned.
> 
> It's interesting that five_uniform’s estimated ndistinct is low by > 50% (actual = 62958). Paradoxically, if I manually set ndistinct to the correct value of 62958, the estimate gets worse (3 rows instead of 7). 
> 
> Suggestions for fixing this are of course welcome. :-)
> 
> On a related topic, are there any in depth guides to the planner that I could read? I can (and have) read the source code and it’s been informative, but something higher level than the source code would help.
> 
> 
> You may already know this, but there's a bunch of documents up there: https://wiki.postgresql.org/wiki/Using_EXPLAIN


Bien merci, yes, I've visited most of those links and learned an enormous amount from them. I've downloaded many of them for re-reading, including yours. :-) It's helpful to be reminded of them again.

EXPLAIN ANALYZE tells me what choices the planner made, but it doesn't tell me why the planner made those choices. For instance, Tomas Vondra's post enabled me to calculate how the planner arrived at its estimate of 7 rows for one node of my query. I would prefer not to reverse engineer the planner's calculation, but instead have the planner just tell me. 

If I was able to combine that information with a summary of the planner's algorithm (a lot to ask for!), then I could understand how the planner chose its plan.

The query I asked about in the original post of this thread has 13 relations in it. IIUC, that's 13! or > 6 billion possible plans. How did the planner pick one plan out of 6 billion? I'm curious, both for practical purposes (I want my query to run well) and also because it's fascinating.

Cheers
Philip







[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux