Hi all,Working on the emaj extension (for the curious ones, https://emaj.readthedocs.io/en/latest/ and https://github.com/dalibo/emaj), I recently faced a performance problem when querying and aggregating data changes. A query with 3 CTE has a O^2 behavior (https://explain.dalibo.com/plan/1ded242d4ebf3gch#plan). I have found a workaround by setting enable_nestloop to FALSE. But this has drawbacks. So I want to better understand the issue.
During my analysis, I realized that the output rows estimate of the second CTE is really bad, leading to a bad plan for the next CTE.
I reproduced the issue in a very small test case with a simplified query. Attached is a shell script and its output.
A simple table is created, filled and analyzed. The simplified statement is: WITH keys AS ( SELECT c1, min(seq) AS seq FROM perf GROUP BY c1 ) SELECT tbl.* FROM perf tbl JOIN keys ON (keys.c1 = tbl.c1 AND keys.seq = tbl.seq); Its plan is:Hash Join (cost=958.00..1569.00 rows=1 width=262) (actual time=18.516..30.702 rows=10000 loops=1)
Output: tbl.c1, tbl.seq, tbl.c2 Inner Unique: true Hash Cond: ((tbl.c1 = perf.c1) AND (tbl.seq = (min(perf.seq)))) Buffers: shared hit=856-> Seq Scan on public.perf tbl (cost=0.00..548.00 rows=12000 width=262) (actual time=0.007..2.323 rows=12000 loops=1)
Output: tbl.c1, tbl.seq, tbl.c2 Buffers: shared hit=428-> Hash (cost=808.00..808.00 rows=10000 width=8) (actual time=18.480..18.484 rows=10000 loops=1)
Output: perf.c1, (min(perf.seq)) Buckets: 16384 Batches: 1 Memory Usage: 519kB Buffers: shared hit=428-> HashAggregate (cost=608.00..708.00 rows=10000 width=8) (actual time=10.688..14.321 rows=10000 loops=1)
Output: perf.c1, min(perf.seq) Group Key: perf.c1 Batches: 1 Memory Usage: 1425kB Buffers: shared hit=428-> Seq Scan on public.perf (cost=0.00..548.00 rows=12000 width=8) (actual time=0.002..2.330 rows=12000 loops=1)
Output: perf.c1, perf.seq, perf.c2 Buffers: shared hit=428 It globally looks good to me, with 2 sequential scans and a hash join.But the number of returned rows estimate is always 1, while it actually depends on the data content (here 10000).
For the hash join node, the plan shows a "Inner Unique: true" property. I wonder if this is normal. It look likes the optimizer doesn't take into account the presence of the GROUP BY clause in its estimate.
I reproduce the case with all supported postgres versions. Thanks by advance for any explanation. Philippe.
Attachment:
test_case.sh
Description: application/shellscript
select version(); version ------------------------------------------------------------------------------------------------------- PostgreSQL 15.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit (1 row) \timing on Timing is on. \set ON_ERROR_STOP SET client_min_messages TO WARNING; SET Time: 0,197 ms --> create structures create table perf ( c1 integer not null, seq serial not null, c2 text ); CREATE TABLE Time: 10,592 ms --> populate the table insert into perf (c1, c2) select i, rpad('2',300) from generate_series (1, 1000*:p_scaleFactor) i; INSERT 0 10000 Time: 49,318 ms --> perform changes insert into perf (c1,c2) select c1, 'updated' from perf where c1 % 5 = 0; INSERT 0 2000 Time: 34,606 ms --> vacuum and analyze select count(*) from perf; count ------- 12000 (1 row) Time: 4,693 ms vacuum analyze perf; VACUUM Time: 28,937 ms --> bad estimate of the number of output rows (always 1) explain (analyze, buffers, verbose) WITH keys AS ( SELECT c1, min(seq) AS seq FROM perf GROUP BY c1 ) SELECT tbl.* FROM perf tbl JOIN keys ON (keys.c1 = tbl.c1 AND keys.seq = tbl.seq) ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=958.00..1569.00 rows=1 width=262) (actual time=14.902..24.311 rows=10000 loops=1) Output: tbl.c1, tbl.seq, tbl.c2 Inner Unique: true Hash Cond: ((tbl.c1 = perf.c1) AND (tbl.seq = (min(perf.seq)))) Buffers: shared hit=856 -> Seq Scan on public.perf tbl (cost=0.00..548.00 rows=12000 width=262) (actual time=0.004..1.813 rows=12000 loops=1) Output: tbl.c1, tbl.seq, tbl.c2 Buffers: shared hit=428 -> Hash (cost=808.00..808.00 rows=10000 width=8) (actual time=14.880..14.881 rows=10000 loops=1) Output: perf.c1, (min(perf.seq)) Buckets: 16384 Batches: 1 Memory Usage: 519kB Buffers: shared hit=428 -> HashAggregate (cost=608.00..708.00 rows=10000 width=8) (actual time=8.468..11.305 rows=10000 loops=1) Output: perf.c1, min(perf.seq) Group Key: perf.c1 Batches: 1 Memory Usage: 1425kB Buffers: shared hit=428 -> Seq Scan on public.perf (cost=0.00..548.00 rows=12000 width=8) (actual time=0.001..1.833 rows=12000 loops=1) Output: perf.c1, perf.seq, perf.c2 Buffers: shared hit=428 Planning: Buffers: shared hit=20 Planning Time: 0.197 ms Execution Time: 25.242 ms (24 rows) Time: 26,096 ms