>> Seems promising but could you provide me a reference to PostgreSQL >> documentation regarding this "a%8=*" feature? >> Best > > % is the modulus operator. > >Assuming "a" is an integer (I don't remember), then doing 8 selects of "a modulus 8" = for each of the possible results (0..7)? will each select about 1/8 of the entire table (I would guess) and the end result put together, they will end up selecting all of the original table. I don't know, myself, why this would be faster. But I'm not any kind of a PostgreSQL expert either. yes. Extracting unique values from very large sets is not for free, neither from the I/O nor from the cpu point of view spreading the tasks on more CPUs should reduce I/O. (Does your huge table fit in RAM ?) If you don't have int values available for the % operator, you may also consider (hashtext(value))/%, but the extra work may result in no benefit. hashtext is not documented as it is not garanteed to stay stables in future Postgres release, but is safe in such a case (http://lmgtfy.com/?q=postgres+hashtext+partition). Another point is that I'm not sure that all threads will grep on a shared scan on a freshly created table where the visiblity hint bit is not yet set: (see head comment in http://doxygen.postgresql.org/syncscan_8c_source.html) ...because reading a table for the first time implies to rewrite it: http://www.cybertec.at/speeding-up-things-with-hint-bits/ You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new COPY FREEZE option: http://www.postgresql.org/docs/9.4/interactive/sql-copy.html hmmm, on the other hand, I suppose that you will avoid the extra I/O for the hint bit if you first copy your data in a temp table, but then you won't be able to parallelize the job as other transactions won't see the data. Moreover you need to pay attention to how much work_mem you can afford to each transaction, knowing you have x of them running concurrently. So at the end I'm not sure if multiple threads will help here. I'm using this approach in aggregations which are more cpu intensive than a simple distinct. I'm looking forward to see your tests results :) Marc Mamin > > Daniel > > -----Original Message----- > From: pgsql-general-owner@xxxxxxxxxxxxxx > [mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Marc Mamin > Sent: December-12-14 06:41 > To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe' > Cc: 'Andy Colson'; pgsql-general@xxxxxxxxxxxxxx > Subject: Re: Removing duplicate records from a bulk upload > (rationale behind selecting a method) > > > >Thank Tom, > >I understand that the rationale behind choosing to create a new table > >from distinct records is that, since both approaches need full table > >scans, selecting distinct records is faster (and seems more straight > >forward) than finding/deleting duplicates; > > Hi, > on a large table you may get it faster while using more than one thread. > e.g.: > > select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select > a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ... > select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c; > > This will/should use a shared full table scan on oldtable. > > HTH > > Marc Mamin > > > > > >Best regards, > >Daniel > > > >-----Original Message----- > >From: pgsql-general-owner@xxxxxxxxxxxxxx > >[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Tom Lane > >Sent: December-08-14 21:52 > >To: Scott Marlowe > >Cc: Andy Colson; Daniel Begin; pgsql-general@xxxxxxxxxxxxxx > >Subject: Re: Removing duplicate records from a bulk upload > >(rationale behind selecting a method) > > > >Scott Marlowe <scott.marlowe@xxxxxxxxx> writes: > >> If you're de-duping a whole table, no need to create indexes, as it's > >> gonna have to hit every row anyway. Fastest way I've found has been: > > > >> select a,b,c into newtable from oldtable group by a,b,c; > > > >> On pass, done. > > > >> If you want to use less than the whole row, you can use select > >> distinct on (col1, col2) * into newtable from oldtable; > > > >Also, the DISTINCT ON method can be refined to control which of a set > >of duplicate keys is retained, if you can identify additional columns > >that constitute a preference order for retaining/discarding dupes. See > >the "latest weather reports" example in the SELECT reference page. > > > >In any case, it's advisable to crank up work_mem while performing this > >operation. > > > > regards, tom lane > > > > > >-- > >Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To > >make changes to your subscription: > >http://www.postgresql.org/mailpref/pgsql-general > > > > > > > >-- > >Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To > >make changes to your subscription: > >http://www.postgresql.org/mailpref/pgsql-general > > > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make > changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > > >-- >? >While a transcendent vocabulary is laudable, one must be eternally careful so that the calculated objective of communication does not become ensconced in obscurity. In other words, eschew obfuscation. > >Maranatha! <>< >John McKown > |