Re: Scrub one large table against another

Brendan Curran <brendan.curran@xxxxxxxxx> · Tue, 10 Oct 2006 16:37:00 -0600

Tom Lane wrote:
Brendan Curran <brendan.curran@xxxxxxxxx> writes:
CREATE TEMP TABLE temp_list_suppress(email_record_id int8);

INSERT INTO temp_list_suppress
	SELECT email_record_id from ONLY email_record er
	WHERE email_list_id = 9 AND email IN
	(select email from suppress);

CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );

INSERT INTO er_banned
SELECT * from ONLY email_record er WHERE EXISTS
(SELECT 1 from temp_list_suppress ts where er.email_record_id = 
ts.email_record_id)';

DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
	(SELECT email_record_id from temp_list_suppress);

TRUNCATE TABLE temp_list_suppress;
DROP TABLE temp_list_suppress;

The performance is dreadful, is there a more efficient way to do this? 

Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps?
If you don't even know which part is slow, it's hard to improve.

FIRST INSERT (Just the select is explained):
Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=32)
  Hash Cond: (("outer".email)::text = ("inner".email)::text)
  ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25)
        ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25)
              Sort Key: suppress.email
              ->  Seq Scan on suppress  (cost=0.00..393024.60 
rows=21646260 width=25)
  ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32)
        ->  Bitmap Heap Scan on email_record er 
(cost=38464.83..3899868.47 rows=4606808 width=32)
              Recheck Cond: (email_list_id = 13)
              ->  Bitmap Index Scan on list  (cost=0.00..38464.83 
rows=4606808 width=0)
                    Index Cond: (email_list_id = 13)

SECOND INSERT (Using EXISTS):
Seq Scan on email_record er  (cost=0.00..381554175.29 rows=62254164 
width=1863)
  Filter: (subplan)
  SubPlan
    ->  Index Scan using er_primeq_pk on er_primeq eq  (cost=0.00..3.03 
rows=1 width=0)
          Index Cond: ($0 = email_record_id)

SECOND INSERT (Using IN):
Nested Loop  (cost=26545.94..2627497.28 rows=27134 width=1863)
  ->  HashAggregate  (cost=26545.94..33879.49 rows=733355 width=8)
        ->  Seq Scan on er_primeq  (cost=0.00..24712.55 rows=733355 
width=8)
  ->  Index Scan using email_record_pkey on email_record er 
(cost=0.00..3.52 rows=1 width=1863)
        Index Cond: (er.email_record_id = "outer".email_record_id)
        Filter: (email_list_id = 13)

DELETE
Nested Loop  (cost=26545.94..2627497.28 rows=50846 width=6)
  ->  HashAggregate  (cost=26545.94..33879.49 rows=733355 width=8)
        ->  Seq Scan on er_primeq  (cost=0.00..24712.55 rows=733355 
width=8)
  ->  Index Scan using email_record_pkey on email_record 
(cost=0.00..3.52 rows=1 width=14)
        Index Cond: (email_record.email_record_id = 
"outer".email_record_id)
        Filter: (email_list_id = 9)

To get this explain data I used a sample "temp_suppress" table that 
contained about 700k rows and was indexed but not analyzed...

It would probably help to do an "ANALYZE temp_list_suppress" right after
populating the temp table.  As you have it, the second insert and delete
are being planned with nothing more than a row count (obtained during
CREATE INDEX) and no stats about distribution of the table contents.

Also, I'd be inclined to try replacing the EXISTS with an IN test;
in recent PG versions the planner is generally smarter about IN.
(Is there a reason why you are doing the INSERT one way and the
DELETE the other?)

BTW, that TRUNCATE right before the DROP seems quite useless,
although it's not the main source of your problem.

			regards, tom lane