Hi, I was testing a query to delete duplicates to see how well using ctid works if the table doesn't have a unique identifier available. The table definition is: create table dupes ( id integer primary key, first_name text, last_name text ); My test table has 100.000 rows with ~13000 being actually unique. The following statement: DELETE FROM dupes WHERE id NOT IN (SELECT min(b.id) FROM dupes b GROUP BY first_name, last_Name HAVING count(*) > 1); produces a quite nice execution plan: Delete on public.dupes (cost=2770.00..4640.00 rows=50000 width=6) (actual time=299.809..299.809 rows=0 loops=1) Buffers: shared hit=88100 -> Seq Scan on public.dupes (cost=2770.00..4640.00 rows=50000 width=6) (actual time=150.113..211.340 rows=86860 loops=1) Output: dupes.ctid Filter: (NOT (hashed SubPlan 1)) Buffers: shared hit=1240 SubPlan 1 -> HashAggregate (cost=2620.00..2745.00 rows=10000 width=18) (actual time=115.739..143.004 rows=13140 loops=1) Output: min(b.id), b.first_name, b.last_name Filter: (count(*) > 1) Buffers: shared hit=620 -> Seq Scan on public.dupes b (cost=0.00..1620.00 rows=100000 width=18) (actual time=0.006..15.563 rows=100000 loops=1) Output: b.id, b.first_name, b.last_name Buffers: shared hit=620 Total runtime: 301.241 ms Now assuming I do not have a unique value in the table. In that case I would revert to using the ctid to identify individual rows: DELETE FROM dupes WHERE ctid NOT IN (SELECT min(b.ctid) FROM dupes b GROUP BY first_name, last_Name HAVING count(*) > 1); Which has a completely different execution plan: Delete on public.dupes (cost=2620.00..10004490.00 rows=50000 width=6) (actual time=269966.623..269966.623 rows=0 loops=1) Buffers: shared hit=88720 -> Seq Scan on public.dupes (cost=2620.00..10004490.00 rows=50000 width=6) (actual time=176.107..269582.651 rows=86860 loops=1) Output: dupes.ctid Filter: (NOT (SubPlan 1)) Buffers: shared hit=1240 SubPlan 1 -> Materialize (cost=2620.00..2795.00 rows=10000 width=20) (actual time=0.002..0.799 rows=12277 loops=100000) Output: (min(b.ctid)), b.first_name, b.last_name Buffers: shared hit=620 -> HashAggregate (cost=2620.00..2745.00 rows=10000 width=20) (actual time=131.162..164.941 rows=13140 loops=1) Output: min(b.ctid), b.first_name, b.last_name Filter: (count(*) > 1) Buffers: shared hit=620 -> Seq Scan on public.dupes b (cost=0.00..1620.00 rows=100000 width=20) (actual time=0.005..29.531 rows=100000 loops=1) Output: b.ctid, b.first_name, b.last_name Buffers: shared hit=620 Total runtime: 269968.515 ms This is Postgres 9.1.4 64bit on Windows 7 Why does the usage of the CTID column change the plan so drastically? Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance