On 10/21/09 4:52 AM, "Shaul Dar" <shauldar@xxxxxxxxx> wrote: > Tom, > > 1. Actually I just tested you suggestion > > SELECT COUNT (*) FROM T1 where NOT EXISTS > (SELECT 1 FROM T2 where T1.PK <http://t1.pk/> = T2.FK <http://t2.fk/> ) > > and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it > took 1m8s, > > My suggestion, i.e. > > SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK <http://t1.pk/> = T2.FK > <http://t2.fk/> > WHERE T2.FK <http://t2.fk/> IS NULL > > was about twice as fast, 37s. (both returned same number of rows, about 2/3 of > T1) > > However I can use DELETE with your version (instead of "SELECT COUNT (*)" > above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks! > > 2. BTW. I presented my question earlier in an overly simplified fashion. > Sorry. In actuality the two tables are joined on two columns, > say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb = > T2.FKb. So the IN versions suggested will not work > since AFAIK IN only works for a single value. The performance will stink in many cases, but IN and NOT IN can work on multiple values, for example: WHERE (a.key1, a.key2) NOT IN (SELECT b.key1, b.key2 FROM b). The fastest (in 8.4) is definitely NOT EXISTS. WHERE NOT EXISTS (SELECT 1 FROM b WHERE (b.key1, b.key2) = (a.key1, a.key2)) I've done this, deleting from tables with 15M + rows where I need a "not in" on two or three columns on multiple other tables. However, NOT EXISTS is only fast if every NOT EXISTS clause is a select on one table, if it is multiple tables and a join, things can get ugly and the planner might not optimize it right. In that case use two NOT EXISTS clauses. Always look at the EXPLAIN plan. With 8.4 -- for performance generally prefer the following: * prefer JOIN and implicit joins to IN and EXISTS. * prefer 'NOT EXISTS' to 'NOT IN' or 'LEFT JOIN where (right is null)' > > -- Shaul > > On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> Shaul Dar <shauldar@xxxxxxxxx> writes: >>> I assume this will work but will take a long time: >> >>> DELETE * FROM T1 where T1.PK <http://T1.PK> NOT IN >>> (SELECT T1.PK <http://T1.PK> FROM T1, T2 where T1.PK <http://T1.PK> = >>> T2.FK <http://T2.FK> ) >> >> Well, yeah, but it's unnecessarily inefficient --- why not just >> >> DELETE FROM T1 where T1.PK <http://T1.PK> NOT IN >> (SELECT T2.FK <http://T2.FK> FROM T2) >> >> However, that still won't be tremendously fast unless the subselect fits >> in work_mem. As of 8.4 this variant should be reasonable: >> >> DELETE FROM T1 where NOT EXISTS >> (SELECT 1 FROM T2 where T1.PK <http://T1.PK> = T2.FK <http://T2.FK> ) >> >> Pre-8.4 you should resort to the "left join where is null" trick, >> but there's no need to be so obscure as of 8.4. >> >> regards, tom lane > > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance