Re: Planning performance problem (67626.278ms)

Jeremy Schneider <schnjere@xxxxxxxxxx> · Tue, 20 Apr 2021 14:00:33 -0700

Two years later, I still remember this. And today I just confirmed
someone hitting this on open source PG13.

On the incident I saw today, the tuple reads during planning were
leading to excessive mxid waits - I suspect because memory pressure
probably caused the mxid offsets/member files to get flushed and start
reading from disk. It was the mxid waits that actually made the system
go sideways - I was just surprised that we saw most of the system CPU in
planning time (rather than execution).

I really think there's something here in the planner that's still
causing headaches for people; I've this seen a few times now. Looks to
me like the common theme is:

try_mergejoin_path ->
initial_cost_mergejoin ->
mergejoinscansel ->
scalarineqsel ->
ineq_histogram_selectivity ->
get_actual_variable_range

And from here it starts calling index_getnext() which can go on for a
very long time and the system seems to fall over if it begins to involve
much physical I/O.

I'll continue to keep an eye out for this, and keep this thread updated
if I find anything else that might move the understanding forward.

Thanks,
Jeremy Schneider

On 4/8/19 16:26, Andres Freund wrote:
> Hi,
> 
> On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote:
>>
>> For what it's worth, I have seen a similar issue on Aurora PG 9.6 where
>> query planning took a very long time (multiple minutes). In this
>> particular case, there wasn't anything Aurora-specific about the call to
>> get_actual_variable_range. We weren't able to distinctly identify the
>> root cause or build a reproducible test case -- but we suspect that an
>> inefficiency might exist in community PostgreSQL code.
>>
>> For debugging, a few ideas:
>>
>> 1) capture a stack with pstack or perf record --call-graph
>> 2) capture the execution plan of the SQL w slow planning
>> 3) capture detailed stats for all relations and objects involved
>> 4) capture the usual info for bug reporting (preface section in docs)
>>
>> A reproducible test case is the gold standard; I'm keeping my eyes open
>> for another case too.
>>
>> For the slow planning case that I saw, the slow process was almost
>> entirely in this call stack (captured with perf record --call-graph):
>> ...
>> index_fetch_heap
>> index_getnext
>> get_actual_variable_range
>> ineq_histogram_selectivity
>> scalarineqsel
>> mergejoinscansel
>> initial_cost_mergejoin
>> try_mergejoin_path
>> add_paths_to_joinrel
>> make_join_rel
>> join_search_one_level
>> standard_join_search
>> make_one_rel
>> query_planner
>> ...
> 
> I suspect some of this might be related to < 11 not having the following
> commit:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services