Re: Scrub one large table against another

Brendan Curran <brendan.curran@xxxxxxxxx> · Thu, 12 Oct 2006 12:05:04 -0600

Well, IN != EXISTS != JOIN. Exists just stops as soon as it finds a
record. For some cases, it's equivalent to IN, but not all. IN has to
de-duplicate it's list in some fashion. For small IN lists, you can do
this with an OR, but at some point you need to switch to an actual
unique (actually, I suspect the difference in PostgreSQL just depends on
if you passed values into IN or a subquery). A join on the other hand
doesn't worry about duplicates at all. There may be some brains in the
planner that realize if a subquery will return a unique set (ie: you're
querying on a primary key).

I agree, and it makes sense now that I consider it that IN would force the planner to implement some 
form of unique check - possibly leveraging a PK or unique index if one is already available. Maybe 
I'll tack up a note to the online documentation letting people know so that it's a little more 
explicitly clear that when you choose IN on data that isn't explicitly unique (to the planner i.e. 
post-analyze) you get the baggage of a forced unique whether you need it or not. Or perhaps someone 
that knows the internals of the planner a little better than me should put some info up regarding that?

Just one more thing... I have found that maintaining a btree index on a 
varchar(255) value is extremely expensive on insert/update/delete. It is 
unfortunately necessary for me to maintain this index for queries and 
reports so I am transitioning to using an unindexed staging table to 
import data into before merging it with the larger table. All the docs 
and posts recommend is to drop the index, import your data, and then 
create the index again. This is untenable on a daily / bi-weekly basis. 
Is there a more elegant solution to this indexing problem?

You might be happier with tsearch than a regular index.

Thanks, I'll look into using tsearch2 as a possibility. From what I've seen so far it would add 
quite a bit of complexity (necessary updates after inserts, proprietary query syntax that might 
require a large amount of specialization from client apps) but in the end the overhead may be less 
than that of maintaining the btree.

Thanks and Regards,
B