On Mon, Mar 8, 2010 at 10:16 AM, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote: > On Sun, Mar 7, 2010 at 11:31 PM, Allan Kamau <kamauallan@xxxxxxxxx> wrote: >> On Mon, Mar 8, 2010 at 5:49 AM, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote: >>> On Sun, Mar 7, 2010 at 1:45 AM, Allan Kamau <kamauallan@xxxxxxxxx> wrote: >>>> Hi, >>>> I am looking for an efficient and effective solution to eliminate >>>> duplicates in a continuously updated "cumulative" transaction table >>>> (no deletions are envisioned as all non-redundant records are >>>> important). Below is my situation. >>> >>> Is there a reason you can't use a unique index and detect failed >>> inserts and reject them? >>> >> >> I think it would have been possible make use of a unique index as you >> have suggested, and silently trap the uniqueness violation. >> >> But in my case (as pointed out in my previous lengthy mail) I am >> inserting multiple records at once, which implicitly means a single >> transaction. I think in this scenario a violation of uniqueness by >> even a single record will lead to all the other records (in this >> batch) being rejected either. > > There are a whole bunch of approaches to this. You're basically bulk > loading data into a table that already has data in it. I'd put > everything into a temp table then insert into main where not exists in > temp table. And I'd still have that unique index on my pk in the main > table. First maintain uniqueness, then figure out how to load data > into it. insert into where not exists is quite fast, and it's all one > big transaction that either goes or doesn't, so there's no cleaning up > broken / extra records later. > > Are you looking at upserting these inputs? Or is the original row good enough? > >> >> Is there perhaps a way to only single out the unique constraint >> violating record(s) without having to perform individual record >> inserts, I am following the example found here >> "http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING". >> >> Allan. >> > > > > -- > When fascism comes to America, it will be intolerance sold as diversity. > The original row is good enough, the new should not replace a similar but already existing record. I will now look at evaluating (for performance). 1) INSERT INTO main_table(...) SELECT a... FROM mytmp_table a WHERE a.supposedly_unique_field NOT IN (SELECT a.supposedly_unique_field FROM main_table a; 2) INSERT INTO main_table(...)SELECT a... FROM mytmp_table a LEFT JOIN main_table b ON b.supposedly_unique_field=a.supposedly_unique_field WHERE a.id IS NULL; But since at a given time it is still likely for more than one client running such insert query to insert data into this table, which may lead to the possibility of transaction(s) that complete after data has been inserted into the main table by other transaction(s) may attempt to insert duplicating records (which will trigger the error if a unique constraint is in effect in the "main" table). I am opting not having a unique constaint on this field. But rather simply an index. Then I make use of a boolean (initially NULL) field in the main table which I will appropriately update at the end of run of all the data updating clients (a cleanup step). The update to this "is_redundant" field could be via "UPDATE main_table b SET is_redundant=TRUE FROM (SELECT min(id)AS id___min,a.supposedly_unique_field FROM main_table a GROUP BY a.supposedly_unique_field HAVING count(*)>1)a WHERE b.supposedly_unique_field=a.supposedly_unique_field AND b.id>a.id___min; Allan. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general