>Thank Tom, >I understand that the rationale behind choosing to create a new table from >distinct records is that, since both approaches need full table scans, >selecting distinct records is faster (and seems more straight forward) than >finding/deleting duplicates; Hi, on a large table you may get it faster while using more than one thread. e.g.: select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ... select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c; This will/should use a shared full table scan on oldtable. HTH Marc Mamin > >Best regards, >Daniel > >-----Original Message----- >From: pgsql-general-owner@xxxxxxxxxxxxxx >[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Tom Lane >Sent: December-08-14 21:52 >To: Scott Marlowe >Cc: Andy Colson; Daniel Begin; pgsql-general@xxxxxxxxxxxxxx >Subject: Re: Removing duplicate records from a bulk upload >(rationale behind selecting a method) > >Scott Marlowe <scott.marlowe@xxxxxxxxx> writes: >> If you're de-duping a whole table, no need to create indexes, as it's >> gonna have to hit every row anyway. Fastest way I've found has been: > >> select a,b,c into newtable from oldtable group by a,b,c; > >> On pass, done. > >> If you want to use less than the whole row, you can use select >> distinct on (col1, col2) * into newtable from oldtable; > >Also, the DISTINCT ON method can be refined to control which of a set of >duplicate keys is retained, if you can identify additional columns that >constitute a preference order for retaining/discarding dupes. See the >"latest weather reports" example in the SELECT reference page. > >In any case, it's advisable to crank up work_mem while performing this >operation. > > regards, tom lane > > >-- >Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make >changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-general > > > >-- >Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-general > -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general