Scrub one large table against another

Brendan Curran <brendan.curran@xxxxxxxxx> · Tue, 10 Oct 2006 15:44:23 -0600

I currently have a db supporting what is for the most part an OLAP data 
warehousing application.

One table (good data) has roughly 120 million rows, divided into roughly 
40 different relational groups (logically by foreign key). Every time I 
add data to this table, I need to afterwards scrub that group against 
known "bad data" which is housed in a second table that has roughly 21 
million rows.

The 120 million row good data table is called "email_record"
The 21 million row bad data table is called "suppress"

There are separate btree indexes on 'email_record_id', 'email_list_id' 
and 'email' on both tables.

Each time I scrub data I pull out anywhere from 1 to 5 million rows from 
the good table (depending on the size of the group i'm scrubbing) and 
compare them against the 21 million rows in the 'suppress' table.

So far I've done this using a temporary staging table that stores only 
the email_record_id for each row from the relevant group of the good 
table. I use a plsql function that does roughly the following (i've 
included only sql syntax and inserted the constant '9' where i would 
normally use a variable):

The characters: email_record_id int8, email varchar(255), email_list_id int8
-------------------------------------------------------------

CREATE TEMP TABLE temp_list_suppress(email_record_id int8);

INSERT INTO temp_list_suppress
	SELECT email_record_id from ONLY email_record er
	WHERE email_list_id = 9 AND email IN
	(select email from suppress);

CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );

INSERT INTO er_banned
SELECT * from ONLY email_record er WHERE EXISTS
(SELECT 1 from temp_list_suppress ts where er.email_record_id = 
ts.email_record_id)';

DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
	(SELECT email_record_id from temp_list_suppress);

TRUNCATE TABLE temp_list_suppress;
DROP TABLE temp_list_suppress;
--------------------------------------------------------------

The performance is dreadful, is there a more efficient way to do this? 
Would I be better off just grabbing * initially from the good table 
instead of just the id to avoid more sequential searches later? Here are 
my configs:

Debian
Postgres 8.1.4
dual zeon
ram: 4 gigs
raid 5

# - Memory -
shared_buffers = 3000
work_mem = 92768
maintenance_work_mem = 128384

autovacuum is turned off, and the db is annalyzed and vacuumed regularly.

Regards,
Brendan