Lok: On Sat, 3 Feb 2024 at 21:44, Lok P <loknath.73@xxxxxxxxx> wrote: > On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <folarte@xxxxxxxxxxxxxx> wrote: > However , as we have ~5billion rows in the base table and out of that , we were expecting almost half i.e. ~2billion would be duplicates. And you said, doing the inserts using the "order by CTID Offset" approach must cause one full sequential scan of the whole table for loading each chunk/10M of rows and that would take a long time I believe. I did not say "MUST CAUSE". In fact I said I believe it would not. What I am gonna say ( now ) is test it. Make 1k, 10k, 100k, 1M tables in a scratch database, explain and test your things there w/ & w/o index etc.. Not all needed, but testing 100k & 1M in 1k batches could show you missing quadratic behaviour. Explain would show you unexpected sorts or scans. > I am still trying to understand the other approach which you suggested. Not able to understand "you can select where index_col > last order by index_col limit 10M," . > However, to get the max ID value of the last 10M loaded rows in target, do you say that having an PK index created on that target table column(ID) will help, and we can save the max (ID) value subsequently in another table to fetch and keep loading from the source table (as ID>Max_ID stored in temp table)? I am a programmer by trade. When faced with problems like these, unless I find a trivial solution, I tend to make auxiliary programs as it is much easier for me to make a thousand lines of Perl ( or python/C/C++ Java ) or a couple hundreds of plpgsql ( if I want to avoid roundtrips ) than trying to debug complicated SQL only workflows. For your kind of problem I would make a program to read the rows and insert them. As an example, lets say you have a non-unique index on ID and are going to use the on-conflict-do-nothing route ( I do believe this would be slow due to the need of having an index on the target table to support it, but let's assume it is ok ). To do that I may just do a loop, starting with last_id=-1(some id less than any other id), selecting a chunk of rows with id>=last_id ordered by id and inserting them. After doing that I may notice that I do not need the index if the sort order is right, drop the index and the on-conflict and just do, for every row, if(id>last_id) insert before storing last_id=id. Anyway, not knowing the real table characteristics and current usage patterns I cannot recomend anything concrete. > Would it be better to do it in one shot only , but by setting a higher value of some parameters like "maintenance_work_mem" or "max_parallel_workers"? It depends on a lot of unknown ( to us ) things. Francisco Olarte.