Re: How to do faster DML

Lok P <loknath.73@xxxxxxxxx> · Sat, 3 Feb 2024 21:39:44 +0530

Ron Johnson <ronljohnsonjr@xxxxxxxxx> 7:37 PM (1 hour ago) 
to pgsql-general
On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson <ronljohnsonjr@xxxxxxxxx> wrote:
On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath.73@xxxxxxxxx> wrote:
Apology. One correction, the query is like below. I. E filter will be on on ctid which I believe is equivalent of rowid in oracle and we will not need the index on Id column then. 

 But, it still runs long, so thinking any other way to make the duplicate removal faster? 

Also wondering , the index creation which took ~2.5hrs+ , would that have been made faster any possible way by allowing more db resource through some session level db parameter setting?
create table TAB1_New
as
SELECT  * from TAB1 A
where CTID in
      (select min(CTID) from TAB1
      group by ID having count(ID)>=1 );

On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath.73@xxxxxxxxx> wrote:
Hello All,
A non partitioned table having ~4.8 billion rows in it and having data size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got approx ~1billion+ duplicate rows inserted in it and we want to get the duplicate data removed for this table and create a PK/unique constraint back so as to not have the duplicate values in future. We are struggling to do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT  * from TAB1 A
where ID in
      (select min(ID) from TAB1
      group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index created on the column using which the duplicate check will be performed i.e ID column. So, creating the index itself took ~2hrs+ and the index size now shows as ~116GB.

Create index idx1 on TAB1(ID)

And now running the SELECT subquery part of the CTAS statement to see if its giving the correct unique records count. It ran for 2.5 hrs and then we killed it. Below is the plan for the same.

explain select  min(ID) from TAB1 A
        group by ID having count(ID)>=1

GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
  Group Key: ID
  Filter: (count(ID) >= 1)
  ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17 rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am worried that creating PK constraint/index back after deleting the duplicate is also going to run forever. Is there any way we can make these heavy operations faster on postgre by facilitating more database resources through some parameter setup, like parallel hint etc? We have pg_hint_plan extension added, but not seeing the parallel hint enforced when adding it to the query.

In Oracle we have Parallel hints, Direct path read/write for faster read/write operations, parallel index scan etc. available, if anything similar to that available in aurora postgre to facilitate more horsepower and speed up the batch operations. And , how can we monitor progress of any running query ?
 Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, 256GB RAM. PG version 15.4.

 Aurora is not Postgresql, so configurations might not work.  Having said that...
https://www.postgresql.org/docs/15t/how-parallel-query-works.html

And have you analyzed the table lately?  Also, what's your work_mem and maintenance_work_mem?

 Thank you .

Below are the values of the default parameters in this instance
SHOW max_worker_processes; - 128
show max_parallel_workers_per_gather;- 4
show max_parallel_workers;- 32
show max_parallel_maintenance_workers; - 2
show maintenance_work_mem; - 4155MB
show work_mem; - 8MB
show shared_buffers ; -22029684

When I ran the CTAS queries and index creation process , I had not a very clear idea of how these are related to each other and help each of the operations,  but I set a few of those as below before triggering those in the same session. 

set max_parallel_workers_per_gather=16;
SET max_parallel_maintenance_workers TO 16;
SET maintenance_work_mem TO '16 GB';

The instance has a total ~256 GB memory, so how should I adjust/bump these values when running heavy SELECT queries doing a large sequential scan  OR large index creation process OR any Select query with heavy sorting/"order by" operations OR heavy JOINS?

I have not analyzed the table manually though , but seeing the auto_vaccum and auto_analyze column getting populated in the pg_stat_user_tables , I thought it must be doing that automatically. 

By the way if we run "analyze tab1' on this 1.5TB table , will that run longer and will any of the above parameters help to expedite that ANALYZE operation too, if I run the ANALYZE manually?

Regards
Lok