Re: Accounting for between table correlation

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Fri, 15 Jan 2021 10:17:35 -0800

On 1/15/21 9:21 AM, Alexander Stoddard wrote:
Unfortunately I'm not free to share the specific schema or the query 
plans. They derive from an upstream vendor that is 'protective' of their 
data model. To get to a proper example I'll need to recreate the 
behavior with generic data in a generified schema.

I apologize for being frustratingly vague. I do feel like an idiot for 
not at least saying this was with version PG 11.10.
It has been beneficial to me that the replies so far appear to validate 
my understanding that no version of postgres has cross table correlation 
statistics in the planner.

Analyze is done frequently, and I think at sufficient sample size. The 
context is in a large data analysis setting and the data is changed via 
bulk ETL not OLTP. The effect on analyzing is just instability - the 
plan can flip in either direction (without underlying data changing) 
between giving an answer in minutes and timing out after 10s of hours. 
That could be indicative of too small a sample but I think in this case 
it is more the statistics aren't useful because it is cross table 
correlation that can't be accounted for.

So to be clear, the process imports the data, then you run a query and 
it completes in x time, you then ANALYZE the same data and it runs in y 
time. Is that correct?

The 'fast plans' use parallel seq scans. The 'slow plans' is using index 
scans. It appears a good query plan correctly predicts it should be bulk 
processing the tables but bad ones get fooled into trashing (hard disk, 
not SSD) by mispredicting too few rows to join between the tables.

If anyone has similar experiences and is generous enough with their time 
to share possible solutions/work arounds then I'm most grateful. If my 
description is too vague to be worthwhile answering then I quite 
understand and apologize for the time wasted in reading.

Thank you.

On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston 
<david.g.johnston@xxxxxxxxx <mailto:david.g.johnston@xxxxxxxxx>> wrote:

    On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat@xxxxxxx
    <mailto:shammat@xxxxxxx>> wrote:

        Atul Kumar schrieb am 15.01.2021 um 16:29:
         > As per Ron, you are not supposed to ask your questions here.
         >
         > As According to him, we should keep on doing research on internet
         > rather than asking for support directly even you have done enough
         > research and until unless “Ron” won’t be satisfied you have to do
         > keep on researching.

        Ron's question was perfectly valid.

        Missing and wrong statistics are one reason for the planner to
        choose a bad execution plan.

    Yeah, at first blush I didn't think analyze really mattered (and it
    mostly doesn't because while you can keep the statistics up-to-date
    the multi-table nature of the problem means they are only marginally
    helpful here), but that just points out the under-specified nature
    of the original posting.  Taken as a simple question of "is there a
    way to work around the lack of multi-table statistics" the analyze,
    and even the specific queries, don't matter all that much.  But it
    also would be much more useful if the OP would choose a single
    problematic query and show the schema, query, and explain results,
    hopefully both good and bad, and comment on how analyze seems to
    affect the plan choice.  But for the general question about
    overcoming our statistics limitations the analyze point is not relevant.

    David J.

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx