Search Postgresql Archives

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/08/2014 03:59 PM, Daniel Begin wrote:
Thanks for your answers Andy; I will keep in mind the procedure you proposed.
About the fields required to find duplicate records, all of them are required (5-9) depending on the table.

Considering that the tables are not indexed yet, am I right to think that both approaches will need a full table scan?
- Deleting duplicate records would need a full table scan to create temporary indexes to select/remove duplicate records;
- Inserting distinct records into an empty table will also need a full table scan to select distinct (*) from big_table;

Once said, is the indexing and selection/deletion of duplicate records faster than rewriting a whole table from distinct records? I am trying to find a rationale behind the choice - execution time, simplicity? I just don't know what the criteria should be and how/why it would affect the selection of an approach:-|

Daniel

-----Original Message-----
From: pgsql-general-owner@xxxxxxxxxxxxxx [mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Andy Colson
Sent: December-08-14 11:39
To: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  Removing duplicate records from a bulk upload

On 12/8/2014 10:30 AM, Andy Colson wrote:
On 12/7/2014 9:31 PM, Daniel Begin wrote:
I have just completed the bulk upload of a large database. Some
tables have billions of records and no constraints or indexes have
been applied yet. About 0.1% of these records may have been
duplicated during the upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table
without duplicate records…

-           Delete duplicate records from the table based on an
appropriate select clause;

-           Create a new table with the results from a select distinct
clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the
job considering …

-           I do not know which records were duplicated;

-           There are no indexes applied on tables yet;

-           There is no OIDS on tables yet;

-           The database is currently 1TB but I have plenty of disk
space.

Daniel


How would you detect duplicate?  Is there a single field that would be
duplicated?  Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the
table
with:

create index bigtable_key on bigtable(key); select key, count(*) from
bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query
and delete statements.

-Andy



I just thought of a more generic way.

1) make a non-unique index on bigtable
2) make a temp table
3) -- copy only dups
insert into temp table
select * from big table where (its a duplicate);

4)
delete from bigtable where keys in (select key from temp);

5)
insert into bigtable
select distinct from temp;


This would minimize the amount of data you have to move around.  Depends on how hard step 3 is to write.  Index not required but would help both step 3 and 4 be faster.

-Andy


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




I was assuming you could find dups using a single field.  The single field could be indexed, and the single field (via index) could find dups, as well as delete rows.

If you have to use all the columns ... well, I'd ignore everything I said. :-)

-Andy



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux