Hi,
Le ven. 30 oct. 2020 à 15:57, Philip Semanchuk <philip@xxxxxxxxxxxxxxxxxxxxx> a écrit :
> On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> wrote:
>
> On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:
>>
>>
>>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby@xxxxxxxxxxxxx>
>>> wrote:
>>>
>>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
>>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
>>>> <philip@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows,
>>>>>>> actual
>>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in
>>>>> this node.
>>>>>>>
>>>>>>> The first is Postgres' estimate. The condition for this index
>>>>>>> scan
>>>>> contains three expressions --
>>>>>>>
>>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
>>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text)
>>>>>
>>>>
>>>> Are the columns correlated?
>>>
>>> I guess it shouldn't matter, since the FKs should remove all but one
>>> of the conditions.
>>
>> Yes, I had the same expectation. I thought Postgres would calculate the
>> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
>> of ‘mike’, but since the frequency estimate is very accurate but the
>> planner’s estimate is not, there’s something else going on.
>>
>
> Well, this is quite a bit more complicated, I'm afraid :-( The clauses
> include parameters passed from the nodes above the index scan. So even
> if we had extended stats on the table, we couldn't use them as that
> requires (Var op Const) conditions. So this likely ends up with a
> product of estimates for each clause, and even then we can't use any
> particular value so we probably end up with something like 1/ndistinct
> or something like that. So if the values actually passed to the index
> scan are more common and/or if the columns are somehow correlated, it's
> not surprising we end up with an overestimate.
I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in the source table, so we have:
25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209
Hence the estimate of 7 rows returned.
It's interesting that five_uniform’s estimated ndistinct is low by > 50% (actual = 62958). Paradoxically, if I manually set ndistinct to the correct value of 62958, the estimate gets worse (3 rows instead of 7).
Suggestions for fixing this are of course welcome. :-)
On a related topic, are there any in depth guides to the planner that I could read? I can (and have) read the source code and it’s been informative, but something higher level than the source code would help.
You may already know this, but there's a bunch of documents up there: https://wiki.postgresql.org/wiki/Using_EXPLAIN
I'm also working on a project to better document this. I'm just at the beginning, writing it all, in english (which isn't my native language), so it takes time. I already have most of it in french in various documents/formats, but it takes time to go through all of these, summarize them, and translate them. Anyway, work in progress as they say. You can have a look at it there: https://pgplanner.readthedocs.io/en/latest/index.html. Any comment/help is very welcome.
--
Guillaume.