Craig James wrote: > Tyrrill, Ed wrote: > >> I have a table, let's call it A, whose primary key, a_id, is referenced >> in a second table, let's call it B. For each unique A.a_id there are >> generally many rows in B with the same a_id. My problem is that I want >> to delete a row in A when the last row in B that references it is >> deleted. Right now I just query for rows in A that aren't referenced by >> B, and that worked great when the tables were small, but it takes over >> an hour now that the tables have grown larger (over 200 million rows in >> B and 14 million in A). The delete has to do a sequential scan of both >> tables since I'm looking for what's not in the indexes. >> >> I was going to try creating a trigger after delete on B for each row to >> check for more rows in B with the same a_id, and delete the row in A if >> none found. In general I will be deleting 10's of millions of rows from >> B and 100's of thousands of rows from A on a daily basis. What do you >> think? Does anyone have any other suggestions on different ways to >> approach this? > > Essentially what you're doing is taking the one-hour job and spreading > out in little chunks over thousands of queries. If you have 10^7 rows > in B and 10^5 rows in A, then on average you have 100 references from B > to A. That means that 99% of the time, your trigger will scan B and find > that there's nothing to do. This could add a lot of overhead to your > ordinary transactions, costing a lot more in the long run than just doing > the once-a-day big cleanout. > > You didn't send the specifics of the query you're using, along with an > EXPLAIN ANALYZE of it in operation. It also be that your SQL is not > optimal, and that somebody could suggest a more efficient query. > > It's also possible that it's not the sequential scans that are the > problem, but rather that it just takes a long time to delete 100,000 > rows from table A because you have a lot of indexes. Or it could be > a combination of performance problems. > > You haven't given us enough information to really analyze your problem. > Send more details! > > Craig Ok. Yes, there are a bunch of indexes on A that may slow down the delete, but if I just run the select part of the delete statement through explain analyze then that is the majority of the time. The complete sql statement for the delete is: delete from backupobjects where record_id in (select backupobjects.record_id from backupobjects left outer join backup_location using(record_id) where backup_location.record_id is null ) What I've referred to as A is backupobjects, and B is backup_location. Here is explain analyze of just the select: mdsdb=# explain analyze select backupobjects.record_id from backupobjects left outer join backup_location using(record_id) where backup_location.record_id is null; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------- Merge Left Join (cost=38725295.93..42505394.70 rows=13799645 width=8) (actual time=6503583.342..8220629.311 rows=93524 loops=1) Merge Cond: ("outer".record_id = "inner".record_id) Filter: ("inner".record_id IS NULL) -> Index Scan using backupobjects_pkey on backupobjects (cost=0.00..521525.10 rows=13799645 width=8) (actual time=15.955..357813.621 rows=13799645 loops=1) -> Sort (cost=38725295.93..39262641.69 rows=214938304 width=8) (actual time=6503265.293..7713657.750 rows=214938308 loops=1) Sort Key: backup_location.record_id -> Seq Scan on backup_location (cost=0.00..3311212.04 rows=214938304 width=8) (actual time=11.175..1881179.825 rows=214938308 loops=1) Total runtime: 8229178.269 ms (8 rows) I ran vacuum analyze after the last time any inserts, deletes, or updates were done, and before I ran the query above. I've attached my postgresql.conf. The machine has 4 GB of RAM. Thanks, Ed
Attachment:
postgresql.conf
Description: postgresql.conf