Thank Tom, I understand that the rationale behind choosing to create a new table from distinct records is that, since both approaches need full table scans, selecting distinct records is faster (and seems more straight forward) than finding/deleting duplicates; Best regards, Daniel -----Original Message----- From: pgsql-general-owner@xxxxxxxxxxxxxx [mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Tom Lane Sent: December-08-14 21:52 To: Scott Marlowe Cc: Andy Colson; Daniel Begin; pgsql-general@xxxxxxxxxxxxxx Subject: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) Scott Marlowe <scott.marlowe@xxxxxxxxx> writes: > If you're de-duping a whole table, no need to create indexes, as it's > gonna have to hit every row anyway. Fastest way I've found has been: > select a,b,c into newtable from oldtable group by a,b,c; > On pass, done. > If you want to use less than the whole row, you can use select > distinct on (col1, col2) * into newtable from oldtable; Also, the DISTINCT ON method can be refined to control which of a set of duplicate keys is retained, if you can identify additional columns that constitute a preference order for retaining/discarding dupes. See the "latest weather reports" example in the SELECT reference page. In any case, it's advisable to crank up work_mem while performing this operation. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general