Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

Scott Carey <scott@xxxxxxxxxxxxxxxxx> · Wed, 21 Oct 2009 11:00:37 -0700

On 10/21/09 4:52 AM, "Shaul Dar" <shauldar@xxxxxxxxx> wrote:

> Tom,
> 
> 1. Actually I just tested you suggestion
> 
> SELECT COUNT (*) FROM T1 where NOT EXISTS
> (SELECT 1 FROM T2 where T1.PK <http://t1.pk/>  = T2.FK <http://t2.fk/> )
> 
> and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it
> took 1m8s,
> 
> My suggestion, i.e.
> 
> SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK <http://t1.pk/>  = T2.FK
> <http://t2.fk/> 
> WHERE T2.FK <http://t2.fk/>  IS NULL
> 
> was about twice as fast, 37s. (both returned same number of rows, about 2/3 of
> T1)
> 
> However I can use DELETE with your version (instead of "SELECT COUNT (*)"
> above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks!
> 
> 2. BTW. I presented my question earlier in an overly simplified fashion.
> Sorry. In actuality the two tables are joined on two columns,
> say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb =
> T2.FKb. So the IN versions suggested will not work
> since AFAIK IN only works for a single value.

The performance will stink in many cases, but IN and NOT IN can work on
multiple values, for example:

WHERE (a.key1, a.key2) NOT IN (SELECT b.key1, b.key2 FROM b).

The fastest (in 8.4) is definitely NOT EXISTS.

WHERE NOT EXISTS (SELECT 1 FROM b WHERE (b.key1, b.key2) = (a.key1, a.key2))

  I've done this, deleting from tables with 15M + rows where I need a "not
in" on two or three columns on multiple other tables.
However, NOT EXISTS is only fast if every NOT EXISTS clause is a select on
one table, if it is multiple tables and a join, things can get ugly and the
planner might not optimize it right.  In that case use two NOT EXISTS
clauses.  Always look at the EXPLAIN plan.

With 8.4 -- for performance generally prefer the following:
* prefer JOIN and implicit joins to IN and EXISTS.
* prefer 'NOT EXISTS' to 'NOT IN' or  'LEFT JOIN where (right is null)'

> 
> -- Shaul
> 
> On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> Shaul Dar <shauldar@xxxxxxxxx> writes:
>>> I assume this will work but will take a long time:
>> 
>>> DELETE * FROM T1 where T1.PK <http://T1.PK>  NOT IN
>>> (SELECT T1.PK <http://T1.PK>  FROM T1, T2 where T1.PK <http://T1.PK>  =
>>> T2.FK <http://T2.FK> )
>> 
>> Well, yeah, but it's unnecessarily inefficient --- why not just
>> 
>> DELETE FROM T1 where T1.PK <http://T1.PK>  NOT IN
>> (SELECT T2.FK <http://T2.FK>  FROM T2)
>> 
>> However, that still won't be tremendously fast unless the subselect fits
>> in work_mem.  As of 8.4 this variant should be reasonable:
>> 
>> DELETE FROM T1 where NOT EXISTS
>> (SELECT 1 FROM T2 where T1.PK <http://T1.PK>  = T2.FK <http://T2.FK> )
>> 
>> Pre-8.4 you should resort to the "left join where is null" trick,
>> but there's no need to be so obscure as of 8.4.
>> 
>>                         regards, tom lane
> 
> 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance