Re: Merge David and Goliath tables efficiently

nicolas paris <nicolas.paris@xxxxxxxxxx> · Sat, 17 Jun 2023 23:42:48 +0200

> > My interpretation reading the query plan is: well sized small
> > batches of upserts leverage the indexes while the regular join
> > choose the sequential scan, including sorting and hashing which
> > takes forever time and resources including disk.
> 
> You may be right, but it's hard to tell without seeing the query
> plan.

Here are part of both plans:

Bad case (strategy 2.1):

->  Merge Left Join  (cost=530202629.03..255884257913.32
rows=17023331531230 width=579)
Merge Cond: (david.list_id = ca.list_id)
->  Sort  (cost=2019172.91..2024398.82 rows=2090361 width=569)
      Sort Key: david.list_id
      ->  Append  (cost=0.00..192152.41 rows=2090361 width=569)
            ->  Seq Scan on david_0 david_1  (cost=0.00..1812.52
rows=20852 width=569)
            ->  Seq Scan on david_1 david_2  (cost=0.00..1800.09
rows=20709 width=569)
            ->  Seq Scan on david_2 david_3  (cost=0.00..1794.44
rows=20644 width=569)

Good case (strategy 3):

Merge on goliath_23 ca  (cost=2139.75..11077.17 rows=0 width=0)
  ->  Nested Loop Left Join  (cost=2139.75..11077.17 rows=1000
width=575)
        ->  Limit  (cost=2139.19..2495.67 rows=1000 width=569)
              ->  Index Scan using david_23_list_id_account_id_idx on
david_23  (cost=0.29..6794.16 rows=19058 width=569)
        ->  Index Scan using goliath_23_list_id_account_id_idx on
goliath_23 ca  (cost=0.56..8.56 rows=1 width=14)
              Index Cond: (list_id = david_23.list_id)

> 
> Sounds very much like you'd benefit from tuning some cost parameters
> to
> make the index scan look cheaper.
> Not sure what 'batched indexed join' would be, but it very much
> sounds
> like a nested loop with an index scan.

Agreed, a 2M nested loop over index scan would likely work as well.
Would tuning the costs param could lead to get such large nested loop ?

> What PostgreSQL version are you using, what hardware? Did you tune it
> in
> any way, or is everything just default?

It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud
provider parameters (RDS).