Re: Scrub one large table against another

Brendan Curran <brendan.curran@xxxxxxxxx> · Wed, 11 Oct 2006 10:53:41 -0600

Tom Lane wrote:
Brendan Curran <brendan.curran@xxxxxxxxx> writes:
So much time is being spent in the Unique and Sort leaves... I would 
think that it wouldn't need to do the unique portion, since there is no 
DISTINCT clause...

There's nothing in that query suggesting that suppress.email is unique.
If you know that it is, try using a plain join instead of an IN.

			regards, tom lane

Interestingly, and thank you to Tom and Jim, the explicit JOIN improved 
performance tremendously (RESULTS BELOW). I converted the entire query 
to use explicit joins instead of IN and EXISTS and discovered acceptable 
performance. I think the next place to go from here is RAID1/RAID10 and 
possibly partitioning my large table (Welcome to DDL insanity, right?).

I have to add that I'm a little surprised the documentation is so 
generous to IN and EXISTS. Is there something amiss in my configuration 
that prevents them from performing correctly? If not, I can't imagine a 
time when IN or EXISTS would be more performant than an explicit JOIN...

Additionally, I manually scrub for duplicates at the group level in the 
email_record table to keep my records unique. I would like to use a 
unique constraint, but have found that batching in JDBC is impossible 
due to irrecoverable errors even when using BEFORE INSERT triggers to 
just return NULL if a record exists already. Has anyone got an elegant 
solution for the 'add only if not exists already' problem similar to 
MSSQL's MERGE command?

Just one more thing... I have found that maintaining a btree index on a 
varchar(255) value is extremely expensive on insert/update/delete. It is 
unfortunately necessary for me to maintain this index for queries and 
reports so I am transitioning to using an unindexed staging table to 
import data into before merging it with the larger table. All the docs 
and posts recommend is to drop the index, import your data, and then 
create the index again. This is untenable on a daily / bi-weekly basis. 
Is there a more elegant solution to this indexing problem?

Thank you for all of your help!

EXPLAIN ANALYZE result comparison...

1. EXPLAIN ANALYZE SELECT email_record_id from ONLY email_record er
    WHERE email_list_id = 13 AND email IN
    (select email from suppress);

Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=8) (actual 
time=2121601.603..2121601.603 rows=0 loops=1)
  Hash Cond: (("outer".email)::text = ("inner".email)::text)
  ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25) 
(actual time=1165955.907..1434439.731 rows=21646261 loops=1)
        ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25) 
(actual time=1165955.903..1384667.715 rows=21646261 loops=1)
              Sort Key: suppress.email
              ->  Seq Scan on suppress  (cost=0.00..393024.60 
rows=21646260 width=25) (actual time=37.784..609848.551 rows=21646261 
loops=1)
  ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual 
time=554522.983..554522.983 rows=3245336 loops=1)
        ->  Bitmap Heap Scan on email_record er 
(cost=38464.83..3899868.47 rows=4606808 width=32) (actual 
time=275640.435..541342.727 rows=3245336 loops=1)
              Recheck Cond: (email_list_id = 13)
              ->  Bitmap Index Scan on list  (cost=0.00..38464.83 
rows=4606808 width=0) (actual time=275102.037..275102.037 rows=5172979 
loops=1)
                    Index Cond: (email_list_id = 13)
Total runtime: 2,122,693.864 ms
--------------------------------------------------------

2. EXPLAIN ANALYZE SELECT email_record_id FROM ONLY email_record er JOIN 
suppress s USING
(email) WHERE er.email_list_id = 13;

Hash Join  (cost=3945127.49..5000543.11 rows=800912 width=8) (actual 
time=808874.088..808874.088 rows=0 loops=1)
  Hash Cond: (("outer".email)::text = ("inner".email)::text)
  ->  Seq Scan on suppress s  (cost=0.00..393024.60 rows=21646260 
width=25) (actual time=661.518..216933.399 rows=21646261 loops=1)
  ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual 
time=494294.932..494294.932 rows=3245336 loops=1)
        ->  Bitmap Heap Scan on email_record er 
(cost=38464.83..3899868.47 rows=4606808 width=32) (actual 
time=242198.226..485942.542 rows=3245336 loops=1)
              Recheck Cond: (email_list_id = 13)
              ->  Bitmap Index Scan on list  (cost=0.00..38464.83 
rows=4606808 width=0) (actual time=241769.786..241769.786 rows=5172979 
loops=1)
                    Index Cond: (email_list_id = 13)
Total runtime: 808,884.387 ms