Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

Marc Mamin <M.Mamin@xxxxxxxxxxxx> · Fri, 12 Dec 2014 19:25:13 +0000

>>    Seems promising but could you provide me a reference to PostgreSQL

>>    documentation regarding this "a%8=*" feature?

>>    Best

>

> % is the modulus operator. 

>

>Assuming "a" is an integer (I don't remember), then doing 8 selects of "a modulus 8" = for each of the possible results (0..7)? will each select about 1/8 of the entire table (I would guess) and the end result put together, they will end up selecting all of
 the original table. I don't know, myself, why this would be faster. But I'm not any kind of a PostgreSQL expert either.

yes.

Extracting unique values from very large sets is not for free,

neither from the I/O nor from the cpu point of view

spreading the tasks on more CPUs should reduce I/O.

(Does your huge table fit in RAM ?)

If you don't have int values available for the % operator, you may also consider 

(hashtext(value))/%, but the extra work may result in no benefit.

hashtext is not documented as it is not garanteed to stay stables in future Postgres release,

but is safe in such a case  (http://lmgtfy.com/?q=postgres+hashtext+partition).

Another point is that I'm not sure that all threads will grep on a shared scan on a freshly created table

where the visiblity hint bit is not yet set:

   (see head comment in http://doxygen.postgresql.org/syncscan_8c_source.html)

...because reading a table for the first time implies to rewrite it:

http://www.cybertec.at/speeding-up-things-with-hint-bits/

You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new COPY FREEZE option:

http://www.postgresql.org/docs/9.4/interactive/sql-copy.html

hmmm, 

on the other hand, I suppose that you will avoid the extra I/O for the hint bit 

if you first copy your data in a temp table, but then you won't be able to parallelize the job

as other transactions won't see the data.

Moreover you need to pay attention to how much work_mem you can afford to each transaction,

knowing you have x of them running concurrently.

So at the end I'm not sure if multiple threads will help here.

I'm using this approach in aggregations which are more cpu intensive than a simple distinct.

I'm looking forward to see your tests results :)

Marc Mamin

>

>    Daniel

>

>    -----Original Message-----

>    From: pgsql-general-owner@xxxxxxxxxxxxxx

>    [mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Marc Mamin

>    Sent: December-12-14 06:41

>    To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'

>    Cc: 'Andy Colson'; pgsql-general@xxxxxxxxxxxxxx

>    Subject: Re:  Removing duplicate records from a bulk upload

>    (rationale behind selecting a method)

>

>

>    >Thank Tom,

>    >I understand that the rationale behind choosing to create a new table

>    >from distinct records is that, since both approaches need full table

>    >scans, selecting distinct records is faster (and seems more straight

>    >forward) than finding/deleting duplicates;

>

>    Hi,

>    on a large table you may get it faster while using more than one thread.

>    e.g.:

>

>    select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select

>    a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...

>    select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

>

>    This will/should use a shared full table scan on oldtable.

>

>    HTH

>

>    Marc Mamin

>

>

>    >

>    >Best regards,

>    >Daniel

>    >

>    >-----Original Message-----

>    >From: pgsql-general-owner@xxxxxxxxxxxxxx

>    >[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Tom Lane

>    >Sent: December-08-14 21:52

>    >To: Scott Marlowe

>    >Cc: Andy Colson; Daniel Begin; pgsql-general@xxxxxxxxxxxxxx

>    >Subject: Re:  Removing duplicate records from a bulk upload

>    >(rationale behind selecting a method)

>    >

>    >Scott Marlowe <scott.marlowe@xxxxxxxxx> writes:

>    >> If you're de-duping a whole table, no need to create indexes, as it's

>    >> gonna have to hit every row anyway. Fastest way I've found has been:

>    >

>    >> select a,b,c into newtable from oldtable group by a,b,c;

>    >

>    >> On pass, done.

>    >

>    >> If you want to use less than the whole row, you can use select

>    >> distinct on (col1, col2) * into newtable from oldtable;

>    >

>    >Also, the DISTINCT ON method can be refined to control which of a set

>    >of duplicate keys is retained, if you can identify additional columns

>    >that constitute a preference order for retaining/discarding dupes.  See

>    >the "latest weather reports" example in the SELECT reference page.

>    >

>    >In any case, it's advisable to crank up work_mem while performing this

>    >operation.

>    >

>    >                        regards, tom lane

>    >

>    >

>    >--

>    >Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To

>    >make changes to your subscription:

>    >http://www.postgresql.org/mailpref/pgsql-general

>    >

>    >

>    >

>    >--

>    >Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To

>    >make changes to your subscription:

>    >http://www.postgresql.org/mailpref/pgsql-general

>    >

>

>    --

>    Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make

>    changes to your subscription:

>    http://www.postgresql.org/mailpref/pgsql-general

>

>

>

>    --

>    Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

>    To make changes to your subscription:

>    http://www.postgresql.org/mailpref/pgsql-general

>

>

>

>-- 

>?

>While a transcendent vocabulary is laudable, one must be eternally careful so that the calculated objective of communication does not become ensconced in obscurity.  In other words, eschew obfuscation.

>

>Maranatha! <><

>John McKown

>