Search Postgresql Archives

Re: How to do faster DML

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <folarte@xxxxxxxxxxxxxx> wrote:
On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane <htamfids@xxxxxxxxx> wrote:
...
> Given the size of your table, you probably want to divide that up.
> As long as nothing is changing the original table, you could do:
>
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 0;
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 10_000_000;
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 20_000_000;

Is it me or does this have the faint smell of quadratic behaviour? I
mean, you need to read and discard the first 10M to do offset 10M (
also I believe ctid order is implied in sequential scan, but no so
sure, if not it would need a full table sort on each pass ).

When doing things like this, I normally have some kind of unique index
and do it by selecting with limit above the last read value( stored
when doing it via code, as I normally load first, index later so I
cannot select max() fast on the target ). Or, with some kind of
"sparse" index (meaning, groups much less than the batch size ) and a
little code you can select where index_col > last order by index_col
limit 10M, remember last received index_col and reselect discarding
missing ( or just reselect it when doing your on conflict do nothing,
which also has a fast select max(id) due to the PK, it will work if it
has an index on id column on the original even if not unique ) to
avoid that.

Also, I'm not sure if ctid is ordered and you can select where
ctid>last ordered, if that works it probably is faster for immutable
origins.

Francisco Olarte.


Thank you. 

Yes  , I think the "on conflict do nothing;" option looks promising as it will remove the duplicate in case of PK violation but keepte load continue for subsequent rows.

However , as we have ~5billion rows in the base table and out of that , we were expecting almost half i.e. ~2billion would be duplicates. And you said, doing the inserts using the "order by CTID Offset" approach must cause one full sequential scan of the whole table for loading each chunk/10M of rows and that would take a long time I believe.

I am still trying to understand the other approach which you suggested. Not able to understand "you can select where index_col > last order by index_col limit 10M," .
However, to get the max ID value of the last 10M loaded rows in target, do you say that having an PK index created on that target table column(ID) will help, and we can save the max (ID) value subsequently in another table to fetch and keep loading from the source table (as ID>Max_ID stored in temp table)?
OR
Would it be better to do it in one shot only , but by setting a higher value of some parameters like "maintenance_work_mem" or "max_parallel_workers"?



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux