Search Postgresql Archives

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Dec 8, 2014, at 9:35 PM, Scott Marlowe wrote:

> select a,b,c into newtable from oldtable group by a,b,c;
> 
> On pass, done.

This is a bit naive, but couldn't this approach potentially be faster (depending on the system)?

	SELECT a, b, c INTO duplicate_records FROM ( SELECT a, b, c, count(*) AS counted FROM source_table GROUP BY a, b, c ) q_inner WHERE q_inner.counted > 1;
	DELETE FROM source_table USING duplicate_records WHERE source_table.a = duplicate_records.a AND source_table.b = duplicate_records.b AND source_table.c = duplicate_records.c;

It would require multiple full table scans, but it would minimize the writing to disk -- and isn't a 'read' operation usually much more efficient than a 'write' operation?  If the duplicate checking is only done on a small subset of columns, indexes could speed things up too.




-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux