Re: Fwd: temp_file_limit?

Frits Jalvingh <jal@xxxxxx> · Fri, 23 Dec 2022 09:12:08 +0100

Hi Justin,
As our queries are generated I decided to create a peephole optimizer kind of thing to scan the generated SQL AST to find multiple conditions on the same table reference. I can then use our metadata to see if these references are expected to be correlated. This creates about 20 statistics sets, including the one you have indicated. This at least makes the problematic query have a stable and very fast plan (so far). I had hoped for some more improvement with other queries but that has not yet been evident ;)

Thanks a lot for the tips and your help!

Cordially,

Frits

On Tue, Dec 20, 2022 at 10:11 PM Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
On Mon, Dec 19, 2022 at 09:10:27PM +0100, Frits Jalvingh wrote:

> @justin

> 

> I tried the create statistics variant and that definitely improves the

> estimate, and with that one of the "bad" cases (the one with the 82 minute

> plan) now creates a good plan using only a few seconds.

> That is a worthwhile path to follow. A bit hard to do, because those

> conditions can be anything, but I can probably calculate the ones used per

> customer and create those correlation statistics from that... It is

> definitely better than tweaking the "poor man's query hints" enable_xxxx

> 8-/ which is really not helping with plan stability either.

> 

> That will be a lot of work, but I'll let you know the results ;)

Yeah, if the conditions are arbitrary, then it's going to be more

difficult.  Hopefully you don't have too many columns.  :)

I suggest enabling autoexplain and monitoring for queries which were

slow, and retroactively adding statistics to those columns which are

most-commonly queried, and which have correlations (which the planner

doesn't otherwise know about).

You won't want to have more than a handful of columns in a stats object

(since it requires factorial(N) complexity), but you can have multiple

stats objects with different combinations of columns (and, in v14,

expressions).  You can also set a lower stats target to make the cost a

bit lower.

You could try to check which columns are correlated, either by running:

| SELECT COUNT(1),col1,col2 FROM tbl GROUP BY 2,3 ORDER BY 1;

for different combinations of columns.

Or by creating a tentative/experimental stats object on a handful of

columns at a time for which you have an intuition about their

correlation, and then checking the calculated dependencies FROM

pg_stats_ext.  You may need to to something clever to use that for

arbitrarily columns.  Maybe this is a start.

| SELECT dep.value::float, tablename, attnames, dep.key, exprs FROM (SELECT (json_each_text(dependencies::text::json)).* AS dep, * FROM pg_stats_ext)dep WHERE dependencies IS NOT NULL ORDER BY 1 DESC ; -- AND regexp_count(key, ',') < 2

-- 

Justin