On 6/17/23 23:42, nicolas paris wrote: >>> My interpretation reading the query plan is: well sized small >>> batches of upserts leverage the indexes while the regular join >>> choose the sequential scan, including sorting and hashing which >>> takes forever time and resources including disk. >> >> You may be right, but it's hard to tell without seeing the query >> plan. > > Here are part of both plans: > I don't understand why you're sharing just a part of the plan and not the whole thing, ideally including the actual update ... Giving us the info in small pieces just means we need to guess and speculate. > Bad case (strategy 2.1): > > -> Merge Left Join (cost=530202629.03..255884257913.32 > rows=17023331531230 width=579) > Merge Cond: (david.list_id = ca.list_id) > -> Sort (cost=2019172.91..2024398.82 rows=2090361 width=569) > Sort Key: david.list_id > -> Append (cost=0.00..192152.41 rows=2090361 width=569) > -> Seq Scan on david_0 david_1 (cost=0.00..1812.52 > rows=20852 width=569) > -> Seq Scan on david_1 david_2 (cost=0.00..1800.09 > rows=20709 width=569) > -> Seq Scan on david_2 david_3 (cost=0.00..1794.44 > rows=20644 width=569) > Well, I kinda doubt you have 17023331531230 rows (not even physically possible with 2TB disk), so that's immediately suspicious. I'd bet the UPDATE ... FROM ... is missing a condition or something like that, which results in a cartesian product. > Good case (strategy 3): > > Merge on goliath_23 ca (cost=2139.75..11077.17 rows=0 width=0) > -> Nested Loop Left Join (cost=2139.75..11077.17 rows=1000 > width=575) > -> Limit (cost=2139.19..2495.67 rows=1000 width=569) > -> Index Scan using david_23_list_id_account_id_idx on > david_23 (cost=0.29..6794.16 rows=19058 width=569) > -> Index Scan using goliath_23_list_id_account_id_idx on > goliath_23 ca (cost=0.56..8.56 rows=1 width=14) > Index Cond: (list_id = david_23.list_id) > >> >> Sounds very much like you'd benefit from tuning some cost parameters >> to >> make the index scan look cheaper. >> Not sure what 'batched indexed join' would be, but it very much >> sounds >> like a nested loop with an index scan. > > Agreed, a 2M nested loop over index scan would likely work as well. > Would tuning the costs param could lead to get such large nested loop ? > It should be, but maybe let's see if there are other problems in the query itself. If it's generating a cartesian product, it's pointless to tune parameters. >> What PostgreSQL version are you using, what hardware? Did you tune it >> in >> any way, or is everything just default? > > It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud > provider parameters (RDS). > I assume 2TO is 2TB? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company