Scrub one large table against another

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I currently have a db supporting what is for the most part an OLAP data warehousing application.

One table (good data) has roughly 120 million rows, divided into roughly 40 different relational groups (logically by foreign key). Every time I add data to this table, I need to afterwards scrub that group against known "bad data" which is housed in a second table that has roughly 21 million rows.

The 120 million row good data table is called "email_record"
The 21 million row bad data table is called "suppress"

There are separate btree indexes on 'email_record_id', 'email_list_id' and 'email' on both tables.

Each time I scrub data I pull out anywhere from 1 to 5 million rows from the good table (depending on the size of the group i'm scrubbing) and compare them against the 21 million rows in the 'suppress' table.

So far I've done this using a temporary staging table that stores only the email_record_id for each row from the relevant group of the good table. I use a plsql function that does roughly the following (i've included only sql syntax and inserted the constant '9' where i would normally use a variable):

The characters: email_record_id int8, email varchar(255), email_list_id int8
-------------------------------------------------------------

CREATE TEMP TABLE temp_list_suppress(email_record_id int8);

INSERT INTO temp_list_suppress
	SELECT email_record_id from ONLY email_record er
	WHERE email_list_id = 9 AND email IN
	(select email from suppress);

CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );

INSERT INTO er_banned
SELECT * from ONLY email_record er WHERE EXISTS
(SELECT 1 from temp_list_suppress ts where er.email_record_id = ts.email_record_id)';

DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
	(SELECT email_record_id from temp_list_suppress);

TRUNCATE TABLE temp_list_suppress;
DROP TABLE temp_list_suppress;
--------------------------------------------------------------

The performance is dreadful, is there a more efficient way to do this? Would I be better off just grabbing * initially from the good table instead of just the id to avoid more sequential searches later? Here are my configs:

Debian
Postgres 8.1.4
dual zeon
ram: 4 gigs
raid 5

# - Memory -
shared_buffers = 3000
work_mem = 92768
maintenance_work_mem = 128384

autovacuum is turned off, and the db is annalyzed and vacuumed regularly.


Regards,
Brendan


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux