Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

John McKown <john.archie.mckown@xxxxxxxxx> · Fri, 12 Dec 2014 10:59:24 -0600

On Fri, Dec 12, 2014 at 9:57 AM, Daniel Begin <jfd553@xxxxxxxxxxx> wrote:
Seems promising but could you provide me a reference to PostgreSQL

documentation regarding this "a%8=*" feature?

Best

% is the modulus operator. 

Assuming "a" is an integer (I don't remember), then doing 8 selects of "a modulus 8" = for each of the possible results (0..7) will each select about 1/8 of the entire table (I would guess) and the end result put together, they will end up selecting all of the original table. I don't know, myself, why this would be faster. But I'm not any kind of a PostgreSQL expert either.

Daniel

-----Original Message-----

From: pgsql-general-owner@xxxxxxxxxxxxxx

[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Marc Mamin

Sent: December-12-14 06:41

To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'

Cc: 'Andy Colson'; pgsql-general@xxxxxxxxxxxxxx

Subject: Re:  Removing duplicate records from a bulk upload

(rationale behind selecting a method)

>Thank Tom,

>I understand that the rationale behind choosing to create a new table

>from distinct records is that, since both approaches need full table

>scans, selecting distinct records is faster (and seems more straight

>forward) than finding/deleting duplicates;

Hi,

on a large table you may get it faster while using more than one thread.

e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select

a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...

select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin

>

>Best regards,

>Daniel

>

>-----Original Message-----

>From: pgsql-general-owner@xxxxxxxxxxxxxx

>[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Tom Lane

>Sent: December-08-14 21:52

>To: Scott Marlowe

>Cc: Andy Colson; Daniel Begin; pgsql-general@xxxxxxxxxxxxxx

>Subject: Re:  Removing duplicate records from a bulk upload

>(rationale behind selecting a method)

>

>Scott Marlowe <scott.marlowe@xxxxxxxxx> writes:

>> If you're de-duping a whole table, no need to create indexes, as it's

>> gonna have to hit every row anyway. Fastest way I've found has been:

>

>> select a,b,c into newtable from oldtable group by a,b,c;

>

>> On pass, done.

>

>> If you want to use less than the whole row, you can use select

>> distinct on (col1, col2) * into newtable from oldtable;

>

>Also, the DISTINCT ON method can be refined to control which of a set

>of duplicate keys is retained, if you can identify additional columns

>that constitute a preference order for retaining/discarding dupes.  See

>the "latest weather reports" example in the SELECT reference page.

>

>In any case, it's advisable to crank up work_mem while performing this

>operation.

>

>                        regards, tom lane

>

>

>--

>Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To

>make changes to your subscription:

>http://www.postgresql.org/mailpref/pgsql-general

>

>

>

>--

>Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To

>make changes to your subscription:

>http://www.postgresql.org/mailpref/pgsql-general

>

--

Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make

changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-general

--

Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-general

-- 
While a transcendent vocabulary is laudable, one must be eternally careful so that the calculated objective of communication does not become ensconced in obscurity.  In other words, eschew obfuscation.

Maranatha! <><
John McKown