Re: Best way to delete big amount of records from big table

Ekaterina Amez <ekaterina.amez@xxxxxxxxxxx> · Fri, 27 Mar 2020 16:15:04 +0100

Hi Laurenz,

El vie., 27 mar. 2020 a las 15:46, Laurenz Albe (<laurenz.albe@xxxxxxxxxxx>) escribió:
On Fri, 2020-03-27 at 15:13 +0100, Ekaterina Amez wrote:

> I'm trying to clean up a database with millions of records of 

> useless-but-don't-remove-just-in-case data. [...]

> 

> But also I'm cleaning tables with 150million records where I'm going to 

> remove 60% of existing data and after a few tests I'm not sure what's 

> the best approach as all seem to take similar time to run. These tables 

> are grouped in 4 tables group with master, detail, master_history, 

> detail_history structure. None of the tables have primary key nor 

> foreign key or any constraint but the sequence used for what should be 

> the PK column, though this column is not defined as PK.

You should define primary and foreign keys if you can, but I guess

I don't have to tell you that.

I know about DB design ;) 
This structure of master-detail-master_hist-detail_hist is repeated all over the DB and other groups of tables are perfectly created with theri PK-FK-UQ-IX... I don't know why these ones haven't been created in the same way.
Excuse me if this is a silly question but I've read (or understood) that it's better to remove constraints to improve delete performance... this is related to indexes only? or also to PK-FK?

> I've decided to delete from the last one in chunks (10 days of data per 

> chunk but it coud be any other quantity) so I've created a function.  

> I've tested it with indexes (in master_hist for filtering data and in 

> detail_hist for the fk and pk), without indexes, after analyzing table, 

> and no matter what I always end up with more or less the same execution 

> time. I can afford the time it's getting to run but I'd like to know if 

> it's there a better way to do this.

There is no need to delete in batches unless you have a need to keep

transactions short (danger of deadlock because the data are still

modified, or you cannot afford to block autovacuum that long).

I prefer doing it in batches because I know there are other processes accessing this table and I can't assure they won't change any data.

If you can drop the indexes while you do it (downtime), go for it.

Perhaps there is a way to use partial indexes that exclude all the

data that you have to delete, then work could go on as normal.

As I said, these particular tables doesn't have any indexes at all. I'll give a try to the partial index suggestion, thanks.

> I'm testing on version 9.2 BUT 

> production server is 8.4 (legacy application, supposed to be in at least 

> 9.2 but recently discovered it was 8.4, planning upgrade but not now). 

> Config parameters are default ones.

Now that is a seriously bad idea.  You should test on the same version

as you have running in production.  And you should insist in an upgrade.

People who insist in running ancient software often insist in ancient

hardware as well, and both is a good way to get data corruption.

If the system blows up, they are going to blame you.

Believe me, I'm totally aware of all of this. Upgrade is planned to happen after I clean up the database. I'm the one that has discover that production server is so old, it looked like no one knew it before. In the time I've been working here I've upgraded 2 servers. 

Yours,

Laurenz Albe

-- 

Cybertec | https://www.cybertec-postgresql.com

Regards,
Ekaterina