On 6/19/23 17:45, nicolas paris wrote: >> But you wrote that in both cases the query is: > > that was indeed yet another tipo, hope to do better in the future. > > >> I'm willing to continue to investigate, but only if you prepare a >> reproducer, > > Thanks for your starter script. Please find attached 2 scripts which > now illustrates two troubles. > > repro1.sql is a slight evolution of yours. When I play with david size > (as described in the comments) you will see plan going from nested loop > to sequential scan. Also note that the partition wise join is likely > working. This illustrate my initial problem: the sequential scan is not > going to work fine on my workload (david too large). How to suggest > postgres to use a nested loop here ? (suspect playing with costs should > help) > In general, this behavior is expected. The overall idea is that nested loops are efficient for small row counts, but the cost raises quickly exactly because they do a lot of random I/O (due to the index scan). At some point it's cheaper to switch to plan does sequential I/O. Which is exactly what's happening here - we switch from a plan doing a lot of random I/O on goliath QUERY PLAN ---------------------------------------------------------------------- Merge on goliath (cost=0.29..7888.69 rows=0 width=0) Merge on goliath_0 goliath_1 ... Merge on goliath_99 goliath_100 -> Append (cost=0.29..7888.69 rows=9998 width=47) -> Nested Loop Left Join (cost=0.29..93.89 rows=120 ... -> Seq Scan on david_0 david_1 (cost=0.00..2.20 ... -> Index Scan using goliath_0_pkey on goliath_0 ... Index Cond: (id = david_1.id) -> Nested Loop Left Join (cost=0.29..77.10 rows=98 ... -> Seq Scan on david_1 david_2 (cost=0.00..1.98 ... -> Index Scan using goliath_1_pkey on goliath_1 ... Index Cond: (id = david_2.id) ... -> Nested Loop Left Join (cost=0.29..74.58 rows=95 ... -> Seq Scan on david_99 david_100 (cost=0.00..1.95 -> Index Scan using goliath_99_pkey on goliath_99 ... Index Cond: (id = david_100.id) (502 rows) to a plan that does a lot of sequential I/O QUERY PLAN ---------------------------------------------------------------------- Merge on goliath (cost=293.44..264556.47 rows=0 width=0) Merge on goliath_0 goliath_1 ... Merge on goliath_99 goliath_100 -> Append (cost=293.44..264556.47 rows=951746 width=47) -> Hash Right Join (cost=293.44..2597.05 rows=9486 ... Hash Cond: (goliath_1.id = david_1.id) -> Seq Scan on goliath_0 goliath_1 (cost=0.00.. -> Hash (cost=174.86..174.86 rows=9486 width=37) -> Seq Scan on david_0 david_1 (cost=0.00.. -> Hash Right Join (cost=295.62..2613.90 rows=9583 width= Hash Cond: (goliath_2.id = david_2.id) -> Seq Scan on goliath_1 goliath_2 (cost=0.00..1845 -> Hash (cost=175.83..175.83 rows=9583 width=37) -> Seq Scan on david_1 david_2 (cost=0.00.. ... -> Hash Right Join (cost=288.33..2593.16 rows=9348 width Hash Cond: (goliath_100.id = david_100.id) -> Seq Scan on goliath_99 goliath_100 (cost=0.00.. -> Hash (cost=171.48..171.48 rows=9348 width=37) -> Seq Scan on david_99 david_100 (cost=0.00.. (602 rows) That's expected, because the cost if I force the Nested Loop with the higher row cound in "david" looks like this: QUERY PLAN ---------------------------------------------------------------------- Merge on goliath (cost=0.29..331253.00 rows=0 width=0) ... Of course, the question is at which point the switch should happen. You can try setting enable_hashjoin=off, which should push the optimizer to use the first plan. If it does, you'll know the cost difference between the two plans. If you run it and it's actually faster than the "default" plan with a hashjoin/seqscans, you can try lowering random_page_cost, which is likely the main input. The default is 4, in general it should be higher than seq_page_cost=1.0 (because random I/O is more expensive). In any case, there's a point when the nested loops get terrible. I mean, imagine the "david" has 100000 rows, and "goliath" hash 100000 pages (800MB). It's just cheaper to do seqscan 100k pages than randomly scan the same 100k pages. You can tune where the plan flips, ofc. > > repro2.sql now I changed the table layout (similar to my setup) to > reproduce the partition wise join which does not triggers. I added a > partition column, and a unique index to be able to mimic a primary key. > Now partition wise (in my local docker vanilla postgres 15.3) does not > work. Eventually, if I do small batch, then the merge is working fast. > That's an other, unrelated problem. > This is absolutely expected. If you partition by hash (id, part_key), you can't join on (id) and expect partitionwise join to work. To quote the enable_partitionwise_join documentation [1]: Enables or disables the query planner's use of partitionwise join, which allows a join between partitioned tables to be performed by joining the matching partitions. Partitionwise join currently applies only when the join conditions include all the partition keys, which must be of the same data type and have one-to-one matching sets of child partitions. So the fact that merge into goliath using david on david.id = goliath.id when matched then update set val = david.val when not matched then insert (id, val) values (david.id, david.val); does not work is absolutely expected. You need to join on part_col too. This is because the partition is determined by hash of both columns, a bit like md5(id+part_col) so knowing just the "id" is not enough to determine the partition. Even if you fix that, the last query won't use partitionwise join because of the ORDER BY subquery, which serves as an optimization fence (which means the merge does not actually see the underlying table is partitioned). If you get rid of that and add the part_col to the join, it translates to the first issue with setting costs to flip to the sequential scan at the right point. [1] https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-JOIN -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company