Re: Scrub one large table against another

"Jim C. Nasby" <jim@xxxxxxxxx> · Wed, 11 Oct 2006 15:52:52 -0500

On Wed, Oct 11, 2006 at 10:53:41AM -0600, Brendan Curran wrote:
> Interestingly, and thank you to Tom and Jim, the explicit JOIN improved 
> performance tremendously (RESULTS BELOW). I converted the entire query 
> to use explicit joins instead of IN and EXISTS and discovered acceptable 
> performance. I think the next place to go from here is RAID1/RAID10 and 
> possibly partitioning my large table (Welcome to DDL insanity, right?).

Remember that partitioning is not a magic bullet: it only helps in cases
where you need to keep a lot of data, but normally only access a small
portion of it.

WAL on RAID5 without a really good controller will probably kill you.
Data being there isn't too much better. You'll probably be better with
either 1 raid 10 or 2 raid 1s.

> I have to add that I'm a little surprised the documentation is so 
> generous to IN and EXISTS. Is there something amiss in my configuration 
> that prevents them from performing correctly? If not, I can't imagine a 
> time when IN or EXISTS would be more performant than an explicit JOIN...

Well, IN != EXISTS != JOIN. Exists just stops as soon as it finds a
record. For some cases, it's equivalent to IN, but not all. IN has to
de-duplicate it's list in some fashion. For small IN lists, you can do
this with an OR, but at some point you need to switch to an actual
unique (actually, I suspect the difference in PostgreSQL just depends on
if you passed values into IN or a subquery). A join on the other hand
doesn't worry about duplicates at all. There may be some brains in the
planner that realize if a subquery will return a unique set (ie: you're
querying on a primary key).

> Additionally, I manually scrub for duplicates at the group level in the 
> email_record table to keep my records unique. I would like to use a 
> unique constraint, but have found that batching in JDBC is impossible 
> due to irrecoverable errors even when using BEFORE INSERT triggers to 
> just return NULL if a record exists already. Has anyone got an elegant 
> solution for the 'add only if not exists already' problem similar to 
> MSSQL's MERGE command?

Your best bet (until we have something akin to MERGE, hopefully in 8.3)
is to load the data into a TEMP table and de-dupe it from there.
Depending on what you're doing you might want to delete it, or update an
ID column in the temp table. Note that assumes that only one process is
loading data at any time, if that's not the case you have to get
trickier.

> Just one more thing... I have found that maintaining a btree index on a 
> varchar(255) value is extremely expensive on insert/update/delete. It is 
> unfortunately necessary for me to maintain this index for queries and 
> reports so I am transitioning to using an unindexed staging table to 
> import data into before merging it with the larger table. All the docs 
> and posts recommend is to drop the index, import your data, and then 
> create the index again. This is untenable on a daily / bi-weekly basis. 
> Is there a more elegant solution to this indexing problem?

You might be happier with tsearch than a regular index.
-- 
Jim Nasby                                            jim@xxxxxxxxx
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)