Two years later, I still remember this. And today I just confirmed someone hitting this on open source PG13. On the incident I saw today, the tuple reads during planning were leading to excessive mxid waits - I suspect because memory pressure probably caused the mxid offsets/member files to get flushed and start reading from disk. It was the mxid waits that actually made the system go sideways - I was just surprised that we saw most of the system CPU in planning time (rather than execution). I really think there's something here in the planner that's still causing headaches for people; I've this seen a few times now. Looks to me like the common theme is: try_mergejoin_path -> initial_cost_mergejoin -> mergejoinscansel -> scalarineqsel -> ineq_histogram_selectivity -> get_actual_variable_range And from here it starts calling index_getnext() which can go on for a very long time and the system seems to fall over if it begins to involve much physical I/O. I'll continue to keep an eye out for this, and keep this thread updated if I find anything else that might move the understanding forward. Thanks, Jeremy Schneider On 4/8/19 16:26, Andres Freund wrote: > Hi, > > On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote: >> >> For what it's worth, I have seen a similar issue on Aurora PG 9.6 where >> query planning took a very long time (multiple minutes). In this >> particular case, there wasn't anything Aurora-specific about the call to >> get_actual_variable_range. We weren't able to distinctly identify the >> root cause or build a reproducible test case -- but we suspect that an >> inefficiency might exist in community PostgreSQL code. >> >> For debugging, a few ideas: >> >> 1) capture a stack with pstack or perf record --call-graph >> 2) capture the execution plan of the SQL w slow planning >> 3) capture detailed stats for all relations and objects involved >> 4) capture the usual info for bug reporting (preface section in docs) >> >> A reproducible test case is the gold standard; I'm keeping my eyes open >> for another case too. >> >> For the slow planning case that I saw, the slow process was almost >> entirely in this call stack (captured with perf record --call-graph): >> ... >> index_fetch_heap >> index_getnext >> get_actual_variable_range >> ineq_histogram_selectivity >> scalarineqsel >> mergejoinscansel >> initial_cost_mergejoin >> try_mergejoin_path >> add_paths_to_joinrel >> make_join_rel >> join_search_one_level >> standard_join_search >> make_one_rel >> query_planner >> ... > > I suspect some of this might be related to < 11 not having the following > commit: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd -- Jeremy Schneider Database Engineer Amazon Web Services