Tom Lane, 24.07.2012 19:12:
Well, it would only help if you're running a PG version that's new enough to recognize the NOT EXISTS as an anti-join; and even then, it's possible that joining on a tid column forecloses enough plan types that you don't get any real benefit. But I'm just guessing. Can you show exactly what you tried and what EXPLAIN ANALYZE results you got?
I am using 9.1.4 (as I said in my initial post). I finally found a solution that runs fine: DELETE FROM dupes a WHERE EXISTS (SELECT 1 FROM dupes b WHERE b.first_name = a.first_name AND b.last_name = a.last_name AND b.ctid > a.ctid); The execution plan for this is: Delete on public.dupes a (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2419.334..2419.334 rows=0 loops=1) Buffers: shared hit=18029 -> Merge Semi Join (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2043.674..2392.707 rows=17097 loops=1) Output: a.ctid, b.ctid Merge Cond: ((a.first_name = b.first_name) AND (a.last_name = b.last_name)) Join Filter: (b.ctid > a.ctid) Buffers: shared hit=930 -> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1024.195..1030.051 rows=75000 loops=1) Output: a.ctid, a.first_name, a.last_name Sort Key: a.first_name, a.last_name Sort Method: quicksort Memory: 8870kB Buffers: shared hit=465 -> Seq Scan on public.dupes a (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.025..23.234 rows=75000 loops=1) Output: a.ctid, a.first_name, a.last_name Buffers: shared hit=465 -> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1019.148..1028.483 rows=105841 loops=1) Output: b.ctid, b.first_name, b.last_name Sort Key: b.first_name, b.last_name Sort Method: quicksort Memory: 8870kB Buffers: shared hit=465 -> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.017..19.133 rows=75000 loops=1) Output: b.ctid, b.first_name, b.last_name Buffers: shared hit=465 Total runtime: 2420.953 ms Which is a lot better than the plan using "WHERE ctid NOT IN (.....)": Delete on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=582515.094..582515.094 rows=0 loops=1) Buffers: shared hit=18027 -> Seq Scan on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=1038.164..582332.927 rows=17097 loops=1) Output: dupes.ctid Filter: (NOT (SubPlan 1)) Buffers: shared hit=930 SubPlan 1 -> Materialize (cost=1777.50..1890.00 rows=7500 width=20) (actual time=0.001..2.283 rows=35552 loops=75000) Output: (min(b.ctid)), b.first_name, b.last_name Buffers: shared hit=465 -> HashAggregate (cost=1777.50..1852.50 rows=7500 width=20) (actual time=90.964..120.228 rows=57903 loops=1) Output: min(b.ctid), b.first_name, b.last_name Buffers: shared hit=465 -> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.008..25.515 rows=75000 loops=1) Output: b.ctid, b.first_name, b.last_name Buffers: shared hit=465 Total runtime: 582517.711 ms Using "WHERE id NOT IN (...)" is the fastest way: Delete on public.dupes (cost=1871.25..3273.75 rows=37500 width=6) (actual time=187.949..187.949 rows=0 loops=1) Buffers: shared hit=18490 -> Seq Scan on public.dupes (cost=1871.25..3273.75 rows=37500 width=6) (actual time=125.351..171.108 rows=17097 loops=1) Output: dupes.ctid Filter: (NOT (hashed SubPlan 1)) Buffers: shared hit=930 SubPlan 1 -> HashAggregate (cost=1777.50..1852.50 rows=7500 width=18) (actual time=73.131..93.421 rows=57903 loops=1) Output: min(b.id), b.first_name, b.last_name Buffers: shared hit=465 -> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=18) (actual time=0.004..8.515 rows=75000 loops=1) Output: b.id, b.first_name, b.last_name Buffers: shared hit=465 Total runtime: 189.222 ms Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance