Re: Scrub one large table against another

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tom Lane wrote:
Brendan Curran <brendan.curran@xxxxxxxxx> writes:
So much time is being spent in the Unique and Sort leaves... I would think that it wouldn't need to do the unique portion, since there is no DISTINCT clause...

There's nothing in that query suggesting that suppress.email is unique.
If you know that it is, try using a plain join instead of an IN.

			regards, tom lane



Interestingly, and thank you to Tom and Jim, the explicit JOIN improved performance tremendously (RESULTS BELOW). I converted the entire query to use explicit joins instead of IN and EXISTS and discovered acceptable performance. I think the next place to go from here is RAID1/RAID10 and possibly partitioning my large table (Welcome to DDL insanity, right?).

I have to add that I'm a little surprised the documentation is so generous to IN and EXISTS. Is there something amiss in my configuration that prevents them from performing correctly? If not, I can't imagine a time when IN or EXISTS would be more performant than an explicit JOIN...

Additionally, I manually scrub for duplicates at the group level in the email_record table to keep my records unique. I would like to use a unique constraint, but have found that batching in JDBC is impossible due to irrecoverable errors even when using BEFORE INSERT triggers to just return NULL if a record exists already. Has anyone got an elegant solution for the 'add only if not exists already' problem similar to MSSQL's MERGE command?

Just one more thing... I have found that maintaining a btree index on a varchar(255) value is extremely expensive on insert/update/delete. It is unfortunately necessary for me to maintain this index for queries and reports so I am transitioning to using an unindexed staging table to import data into before merging it with the larger table. All the docs and posts recommend is to drop the index, import your data, and then create the index again. This is untenable on a daily / bi-weekly basis. Is there a more elegant solution to this indexing problem?

Thank you for all of your help!

EXPLAIN ANALYZE result comparison...

1. EXPLAIN ANALYZE SELECT email_record_id from ONLY email_record er
    WHERE email_list_id = 13 AND email IN
    (select email from suppress);

Hash Join (cost=8359220.68..9129843.00 rows=800912 width=8) (actual time=2121601.603..2121601.603 rows=0 loops=1)
  Hash Cond: (("outer".email)::text = ("inner".email)::text)
-> Unique (cost=4414093.19..4522324.49 rows=21646260 width=25) (actual time=1165955.907..1434439.731 rows=21646261 loops=1) -> Sort (cost=4414093.19..4468208.84 rows=21646260 width=25) (actual time=1165955.903..1384667.715 rows=21646261 loops=1)
              Sort Key: suppress.email
-> Seq Scan on suppress (cost=0.00..393024.60 rows=21646260 width=25) (actual time=37.784..609848.551 rows=21646261 loops=1) -> Hash (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual time=554522.983..554522.983 rows=3245336 loops=1) -> Bitmap Heap Scan on email_record er (cost=38464.83..3899868.47 rows=4606808 width=32) (actual time=275640.435..541342.727 rows=3245336 loops=1)
              Recheck Cond: (email_list_id = 13)
-> Bitmap Index Scan on list (cost=0.00..38464.83 rows=4606808 width=0) (actual time=275102.037..275102.037 rows=5172979 loops=1)
                    Index Cond: (email_list_id = 13)
Total runtime: 2,122,693.864 ms
--------------------------------------------------------

2. EXPLAIN ANALYZE SELECT email_record_id FROM ONLY email_record er JOIN suppress s USING
(email) WHERE er.email_list_id = 13;

Hash Join (cost=3945127.49..5000543.11 rows=800912 width=8) (actual time=808874.088..808874.088 rows=0 loops=1)
  Hash Cond: (("outer".email)::text = ("inner".email)::text)
-> Seq Scan on suppress s (cost=0.00..393024.60 rows=21646260 width=25) (actual time=661.518..216933.399 rows=21646261 loops=1) -> Hash (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual time=494294.932..494294.932 rows=3245336 loops=1) -> Bitmap Heap Scan on email_record er (cost=38464.83..3899868.47 rows=4606808 width=32) (actual time=242198.226..485942.542 rows=3245336 loops=1)
              Recheck Cond: (email_list_id = 13)
-> Bitmap Index Scan on list (cost=0.00..38464.83 rows=4606808 width=0) (actual time=241769.786..241769.786 rows=5172979 loops=1)
                    Index Cond: (email_list_id = 13)
Total runtime: 808,884.387 ms




[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux