Search Postgresql Archives

Re: Removing duplicate records from a bulk upload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/7/2014 9:31 PM, Daniel Begin wrote:
I have just completed the bulk upload of a large database. Some tables
have billions of records and no constraints or indexes have been applied
yet. About 0.1% of these records may have been duplicated during the
upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table without
duplicate records…

-           Delete duplicate records from the table based on an
appropriate select clause;

-           Create a new table with the results from a select distinct
clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the job
considering …

-           I do not know which records were duplicated;

-           There are no indexes applied on tables yet;

-           There is no OIDS on tables yet;

-           The database is currently 1TB but I have plenty of disk space.

Daniel


How would you detect duplicate? Is there a single field that would be duplicated? Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the table with:

create index bigtable_key on bigtable(key);
select key, count(*) from bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query and delete statements.

-Andy


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux