How to do faster DML

Lok P <loknath.73@xxxxxxxxx> · Sat, 3 Feb 2024 17:50:02 +0530

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got approx ~1billion+ duplicate rows inserted in it and we want to get the duplicate data removed for this table and create a PK/unique constraint back so as to not have the duplicate values in future. We are struggling to do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT  * from TAB1 A
where ID in
      (select min(ID) from TAB1
      group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index created on the column using which the duplicate check will be performed i.e ID column. So, creating the index itself took ~2hrs+ and the index size now shows as ~116GB.

Create index idx1 on TAB1(ID)

And now running the SELECT subquery part of the CTAS statement to see if its giving the correct unique records count. It ran for 2.5 hrs and then we killed it. Below is the plan for the same.

explain select  min(ID) from TAB1 A
        group by ID having count(ID)>=1

GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
  Group Key: ID
  Filter: (count(ID) >= 1)
  ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17 rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am worried that creating PK constraint/index back after deleting the duplicate is also going to run forever. Is there any way we can make these heavy operations faster on postgre by facilitating more database resources through some parameter setup, like parallel hint etc? We have pg_hint_plan extension added, but not seeing the parallel hint enforced when adding it to the query.

In Oracle we have Parallel hints, Direct path read/write for faster read/write operations, parallel index scan etc. available, if anything similar to that available in aurora postgre to facilitate more horsepower and speed up the batch operations. And , how can we monitor progress of any running query ?
 Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, 256GB RAM. PG version 15.4.

Regards
Lok