Re: Merge David and Goliath tables efficiently

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxxx> · Tue, 20 Jun 2023 01:25:52 +0200

On 6/19/23 17:45, nicolas paris wrote:
>> But you wrote that in both cases the query is:
> 
> that was indeed yet another tipo, hope to do better in the future.
> 
> 
>> I'm willing to continue to investigate, but only if you prepare a
>> reproducer, 
> 
> Thanks for your starter script. Please find attached 2 scripts which
> now illustrates two troubles.
> 
> repro1.sql is a slight evolution of yours. When I play with david size
> (as described in the comments) you will see plan going from nested loop
> to sequential scan. Also note that the partition wise join is likely
> working. This illustrate my initial problem: the sequential scan is not
> going to work fine on my workload (david too large). How to suggest
> postgres to use a nested loop here ? (suspect playing with costs should
> help)
> 

In general, this behavior is expected. The overall idea is that nested
loops are efficient for small row counts, but the cost raises quickly
exactly because they do a lot of random I/O (due to the index scan). At
some point it's cheaper to switch to plan does sequential I/O. Which is
exactly what's happening here - we switch from a plan doing a lot of
random I/O on goliath

                                  QUERY PLAN
  ----------------------------------------------------------------------
   Merge on goliath  (cost=0.29..7888.69 rows=0 width=0)
     Merge on goliath_0 goliath_1
     ...
     Merge on goliath_99 goliath_100
     ->  Append  (cost=0.29..7888.69 rows=9998 width=47)
           ->  Nested Loop Left Join  (cost=0.29..93.89 rows=120 ...
                 ->  Seq Scan on david_0 david_1  (cost=0.00..2.20 ...
                 ->  Index Scan using goliath_0_pkey on goliath_0 ...
                       Index Cond: (id = david_1.id)
           ->  Nested Loop Left Join  (cost=0.29..77.10 rows=98 ...
                 ->  Seq Scan on david_1 david_2  (cost=0.00..1.98 ...
                 ->  Index Scan using goliath_1_pkey on goliath_1 ...
                       Index Cond: (id = david_2.id)
           ...
           ->  Nested Loop Left Join  (cost=0.29..74.58 rows=95 ...
                 ->  Seq Scan on david_99 david_100  (cost=0.00..1.95
                 ->  Index Scan using goliath_99_pkey on goliath_99 ...
                       Index Cond: (id = david_100.id)
  (502 rows)

to a plan that does a lot of sequential I/O

                                  QUERY PLAN
  ----------------------------------------------------------------------
   Merge on goliath  (cost=293.44..264556.47 rows=0 width=0)
     Merge on goliath_0 goliath_1
     ...
     Merge on goliath_99 goliath_100
     ->  Append  (cost=293.44..264556.47 rows=951746 width=47)
           ->  Hash Right Join  (cost=293.44..2597.05 rows=9486 ...
                 Hash Cond: (goliath_1.id = david_1.id)
                 ->  Seq Scan on goliath_0 goliath_1  (cost=0.00..
                 ->  Hash  (cost=174.86..174.86 rows=9486 width=37)
                       ->  Seq Scan on david_0 david_1  (cost=0.00..
           ->  Hash Right Join  (cost=295.62..2613.90 rows=9583 width=
                 Hash Cond: (goliath_2.id = david_2.id)
                 ->  Seq Scan on goliath_1 goliath_2  (cost=0.00..1845
                 ->  Hash  (cost=175.83..175.83 rows=9583 width=37)
                       ->  Seq Scan on david_1 david_2  (cost=0.00..
           ...
           ->  Hash Right Join  (cost=288.33..2593.16 rows=9348 width
                 Hash Cond: (goliath_100.id = david_100.id)
                 ->  Seq Scan on goliath_99 goliath_100  (cost=0.00..
                 ->  Hash  (cost=171.48..171.48 rows=9348 width=37)
                       ->  Seq Scan on david_99 david_100  (cost=0.00..
  (602 rows)

That's expected, because the cost if I force the Nested Loop with the
higher row cound in "david" looks like this:

                                  QUERY PLAN
  ----------------------------------------------------------------------
   Merge on goliath  (cost=0.29..331253.00 rows=0 width=0)
   ...

Of course, the question is at which point the switch should happen. You
can try setting enable_hashjoin=off, which should push the optimizer to
use the first plan. If it does, you'll know the cost difference between
the two plans.

If you run it and it's actually faster than the "default" plan with a
hashjoin/seqscans, you can try lowering random_page_cost, which is
likely the main input. The default is 4, in general it should be higher
than seq_page_cost=1.0 (because random I/O is more expensive).

In any case, there's a point when the nested loops get terrible. I mean,
imagine the "david" has 100000 rows, and "goliath" hash 100000 pages
(800MB). It's just cheaper to do seqscan 100k pages than randomly scan
the same 100k pages. You can tune where the plan flips, ofc.

> 
> repro2.sql  now I changed the table layout (similar to my setup) to
> reproduce the partition wise join which does not triggers. I added a
> partition column, and a unique index to be able to mimic a primary key.
> Now partition wise (in my local docker vanilla postgres 15.3) does not
> work. Eventually, if I do small batch, then the merge is working fast.
> That's an other, unrelated problem.
> 

This is absolutely expected. If you partition by hash (id, part_key),
you can't join on (id) and expect partitionwise join to work. To quote
the enable_partitionwise_join documentation [1]:

    Enables or disables the query planner's use of partitionwise join,
    which allows a join between partitioned tables to be performed by
    joining the matching partitions. Partitionwise join currently
    applies only when the join conditions include all the partition
    keys, which must be of the same data type and have one-to-one
    matching sets of child partitions.

So the fact that

    merge into goliath using david on david.id = goliath.id
    when matched then update set val = david.val
    when not matched then insert (id, val) values (david.id, david.val);

does not work is absolutely expected. You need to join on part_col too.

This is because the partition is determined by hash of both columns, a
bit like md5(id+part_col) so knowing just the "id" is not enough to
determine the partition.

Even if you fix that, the last query won't use partitionwise join
because of the ORDER BY subquery, which serves as an optimization fence
(which means the merge does not actually see the underlying table is
partitioned).

If you get rid of that and add the part_col to the join, it translates
to the first issue with setting costs to flip to the sequential scan at
the right point.

[1]
https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-JOIN

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company