On 1/15/21 9:21 AM, Alexander Stoddard wrote:
Unfortunately I'm not free to share the specific schema or the query
plans. They derive from an upstream vendor that is 'protective' of their
data model. To get to a proper example I'll need to recreate the
behavior with generic data in a generified schema.
I apologize for being frustratingly vague. I do feel like an idiot for
not at least saying this was with version PG 11.10.
It has been beneficial to me that the replies so far appear to validate
my understanding that no version of postgres has cross table correlation
statistics in the planner.
Analyze is done frequently, and I think at sufficient sample size. The
context is in a large data analysis setting and the data is changed via
bulk ETL not OLTP. The effect on analyzing is just instability - the
plan can flip in either direction (without underlying data changing)
between giving an answer in minutes and timing out after 10s of hours.
That could be indicative of too small a sample but I think in this case
it is more the statistics aren't useful because it is cross table
correlation that can't be accounted for.
So to be clear, the process imports the data, then you run a query and
it completes in x time, you then ANALYZE the same data and it runs in y
time. Is that correct?
The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk,
not SSD) by mispredicting too few rows to join between the tables.
If anyone has similar experiences and is generous enough with their time
to share possible solutions/work arounds then I'm most grateful. If my
description is too vague to be worthwhile answering then I quite
understand and apologize for the time wasted in reading.
Thank you.
On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston
<david.g.johnston@xxxxxxxxx <mailto:david.g.johnston@xxxxxxxxx>> wrote:
On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat@xxxxxxx
<mailto:shammat@xxxxxxx>> wrote:
Atul Kumar schrieb am 15.01.2021 um 16:29:
> As per Ron, you are not supposed to ask your questions here.
>
> As According to him, we should keep on doing research on internet
> rather than asking for support directly even you have done enough
> research and until unless “Ron” won’t be satisfied you have to do
> keep on researching.
Ron's question was perfectly valid.
Missing and wrong statistics are one reason for the planner to
choose a bad execution plan.
Yeah, at first blush I didn't think analyze really mattered (and it
mostly doesn't because while you can keep the statistics up-to-date
the multi-table nature of the problem means they are only marginally
helpful here), but that just points out the under-specified nature
of the original posting. Taken as a simple question of "is there a
way to work around the lack of multi-table statistics" the analyze,
and even the specific queries, don't matter all that much. But it
also would be much more useful if the OP would choose a single
problematic query and show the schema, query, and explain results,
hopefully both good and bad, and comment on how analyze seems to
affect the plan choice. But for the general question about
overcoming our statistics limitations the analyze point is not relevant.
David J.
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx