Re: Merge David and Goliath tables efficiently

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxxx> · Sun, 18 Jun 2023 01:48:51 +0200

On 6/17/23 23:42, nicolas paris wrote:
>>> My interpretation reading the query plan is: well sized small
>>> batches of upserts leverage the indexes while the regular join
>>> choose the sequential scan, including sorting and hashing which
>>> takes forever time and resources including disk.
>>
>> You may be right, but it's hard to tell without seeing the query
>> plan.
> 
> Here are part of both plans:
> 

I don't understand why you're sharing just a part of the plan and not
the whole thing, ideally including the actual update ... Giving us the
info in small pieces just means we need to guess and speculate.

> Bad case (strategy 2.1):
> 
> ->  Merge Left Join  (cost=530202629.03..255884257913.32
> rows=17023331531230 width=579)
> Merge Cond: (david.list_id = ca.list_id)
> ->  Sort  (cost=2019172.91..2024398.82 rows=2090361 width=569)
>       Sort Key: david.list_id
>       ->  Append  (cost=0.00..192152.41 rows=2090361 width=569)
>             ->  Seq Scan on david_0 david_1  (cost=0.00..1812.52
> rows=20852 width=569)
>             ->  Seq Scan on david_1 david_2  (cost=0.00..1800.09
> rows=20709 width=569)
>             ->  Seq Scan on david_2 david_3  (cost=0.00..1794.44
> rows=20644 width=569)
> 

Well, I kinda doubt you have 17023331531230 rows (not even physically
possible with 2TB disk), so that's immediately suspicious. I'd bet the
UPDATE ... FROM ... is missing a condition or something like that, which
results in a cartesian product.

> Good case (strategy 3):
> 
> Merge on goliath_23 ca  (cost=2139.75..11077.17 rows=0 width=0)
>   ->  Nested Loop Left Join  (cost=2139.75..11077.17 rows=1000
> width=575)
>         ->  Limit  (cost=2139.19..2495.67 rows=1000 width=569)
>               ->  Index Scan using david_23_list_id_account_id_idx on
> david_23  (cost=0.29..6794.16 rows=19058 width=569)
>         ->  Index Scan using goliath_23_list_id_account_id_idx on
> goliath_23 ca  (cost=0.56..8.56 rows=1 width=14)
>               Index Cond: (list_id = david_23.list_id)
> 
>>
>> Sounds very much like you'd benefit from tuning some cost parameters
>> to
>> make the index scan look cheaper.
>> Not sure what 'batched indexed join' would be, but it very much
>> sounds
>> like a nested loop with an index scan.
> 
> Agreed, a 2M nested loop over index scan would likely work as well.
> Would tuning the costs param could lead to get such large nested loop ?
> 

It should be, but maybe let's see if there are other problems in the
query itself. If it's generating a cartesian product, it's pointless to
tune parameters.

>> What PostgreSQL version are you using, what hardware? Did you tune it
>> in
>> any way, or is everything just default?
> 
> It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud
> provider parameters (RDS). 
> 

I assume 2TO is 2TB?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company