Re: How to do faster DML

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Sun, 4 Feb 2024 12:31:47 +0100

Lok:

On Sat, 3 Feb 2024 at 21:44, Lok P <loknath.73@xxxxxxxxx> wrote:

> On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <folarte@xxxxxxxxxxxxxx> wrote:

> However , as we have ~5billion rows in the base table and out of that , we were expecting almost half i.e. ~2billion would be duplicates. And you said, doing the inserts using the "order by CTID Offset" approach must cause one full sequential scan of the whole table for loading each chunk/10M of rows and that would take a long time I believe.

I did not say "MUST CAUSE". In fact I said I believe it would not.

What I am gonna say ( now ) is test it. Make 1k, 10k, 100k, 1M tables
in a scratch database, explain and test your things there w/ & w/o
index etc.. Not all needed, but testing 100k & 1M in 1k batches could
show you missing quadratic behaviour. Explain would show you
unexpected sorts or scans.

> I am still trying to understand the other approach which you suggested. Not able to understand "you can select where index_col > last order by index_col limit 10M," .
> However, to get the max ID value of the last 10M loaded rows in target, do you say that having an PK index created on that target table column(ID) will help, and we can save the max (ID) value subsequently in another table to fetch and keep loading from the source table (as ID>Max_ID stored in temp table)?

I am a programmer by trade. When faced with problems like these,
unless I find a trivial solution, I tend to make auxiliary programs as
it is much easier for me to make a thousand lines of Perl ( or
python/C/C++ Java ) or a couple hundreds of plpgsql ( if I want to
avoid roundtrips ) than trying to debug complicated SQL only
workflows.

For your kind of problem I would make a program to read the rows and
insert them. As an example, lets say you have a non-unique index on ID
and are going to use the on-conflict-do-nothing route ( I do believe
this would be slow due to the need of having an index on the target
table to support it, but let's assume it is ok ). To do that I may
just do a loop, starting with last_id=-1(some id less than any other
id), selecting a chunk of rows with id>=last_id ordered by id and
inserting them. After doing that I may notice that I do not need the
index if the sort order is right, drop the index and the on-conflict
and just do, for every row, if(id>last_id) insert before storing
last_id=id.

Anyway, not knowing the real table characteristics and current usage
patterns I cannot recomend anything concrete.

> Would it be better to do it in one shot only , but by setting a higher value of some parameters like "maintenance_work_mem" or "max_parallel_workers"?

It depends on a lot of unknown ( to us ) things.

Francisco Olarte.