Hello
We have an application (https://dhis2.org) which has been using postgresql as a backend for the past 15 years or so. Gradually moving through pg versions 8,9,10 etc as the years went by. At the moment a large number of our implementations are using versions 13, 14 and 15. Unfortunately we have recently discovered that, despite most operations performing considerably better on later versions, there is a particular type of query that is very much slower (up to 100x) than it was on postgresql 11. We have seen this regression in 13, 14 and 15. Unfortunately I dont have stats on version 12 yet.
The query is not beautifully crafted. It is automatically generated from a custom _expression_ language. We know that it can probably be improved, but at the moment we would really like to know if there is anything we can configure with the SQL as-is to get performance like we had back on pg11.
The example below is a typical such query. I've attached below that, links to the results of EXPLAIN (ANALYZE, BUFFERS). for pg11 and pg15 on the same physical environment loaded with the same database. I would appreciate some help trying to understand what we are seeing with the EXPLAIN output and whether there is anything to be done.
EXPLAIN ANALYZE
select
count(pi) as value,
'2022W21' as Weekly
from
analytics_enrollment_gr3uwzvzpqt as ax
where
cast(
(
select
"IEMtgZapP2s"
from
analytics_event_gr3uWZVzPQT
where
analytics_event_gr3uWZVzPQT.pi = ax.pi
and "IEMtgZapP2s" is not null
and ps = 'XO45JBGJcXJ'
order by
executiondate desc
limit
1
) as date
) < cast('2022-05-30' as date)
and cast(
(
select
"IEMtgZapP2s"
from
analytics_event_gr3uWZVzPQT
where
analytics_event_gr3uWZVzPQT.pi = ax.pi
and "IEMtgZapP2s" is not null
and ps = 'XO45JBGJcXJ'
order by
executiondate desc
limit
1
) as date
) >= cast('2022-05-23' as date)
and (uidlevel1 = 'Plmg8ikyfrK')
and (
coalesce(
(
select
"QEbYS2QOXLf"
from
analytics_event_gr3uWZVzPQT
where
analytics_event_gr3uWZVzPQT.pi = ax.pi
and "QEbYS2QOXLf" is not null
and ps = 'XO45JBGJcXJ'
order by
executiondate desc
limit
1
):: text,
''
) = 'FIN_CASE_CLASS_CONFIRM_LAB'
)
limit 1;
-- limit 200001;
select
count(pi) as value,
'2022W21' as Weekly
from
analytics_enrollment_gr3uwzvzpqt as ax
where
cast(
(
select
"IEMtgZapP2s"
from
analytics_event_gr3uWZVzPQT
where
analytics_event_gr3uWZVzPQT.pi = ax.pi
and "IEMtgZapP2s" is not null
and ps = 'XO45JBGJcXJ'
order by
executiondate desc
limit
1
) as date
) < cast('2022-05-30' as date)
and cast(
(
select
"IEMtgZapP2s"
from
analytics_event_gr3uWZVzPQT
where
analytics_event_gr3uWZVzPQT.pi = ax.pi
and "IEMtgZapP2s" is not null
and ps = 'XO45JBGJcXJ'
order by
executiondate desc
limit
1
) as date
) >= cast('2022-05-23' as date)
and (uidlevel1 = 'Plmg8ikyfrK')
and (
coalesce(
(
select
"QEbYS2QOXLf"
from
analytics_event_gr3uWZVzPQT
where
analytics_event_gr3uWZVzPQT.pi = ax.pi
and "QEbYS2QOXLf" is not null
and ps = 'XO45JBGJcXJ'
order by
executiondate desc
limit
1
):: text,
''
) = 'FIN_CASE_CLASS_CONFIRM_LAB'
)
limit 1;
-- limit 200001;
The EXPLAIN result for postgresql 11 is here: https://explain.depesz.com/s/3QfC
The same query on postgresql 15 is here: https://explain.depesz.com/s/BzpA#html
Whereas the first example takes 23s, the pg15 one takes 243s (this time difference is even more stark when you remove BUFFERS from the explain). During execution the pg15 query consumes 100% of a CPU core throughout indicating it is probably cpu bound rather than IO.
The plan selected in both cases seems to be exactly the same. But pg15 seems to make a lot of work of the final aggregation step. Anecdotally I understand that the same difference is there with pg13 and 14. The only significant factor I could think of relating to new behaviour in pg13 is the new hash_mem_multiplier configuration and it its relation to work_mem availbale for hash tables. I have attempted to turn up both hash_mem_multilier and work_mem to ridiculous values and I see no change whatsoever on pg15.
I also removed the LIMIT and tested again with no significant difference: https://explain.depesz.com/s/K9Lq
Does anyone have a theory of why pg15 should behave so differently to pg11 here? Better still, any suggestions for configuration that might make pg15 behave more like pg10. I am really dreading the prospect of stepping our many live implementations back to pg11 :-(.
Regards
Bob