Re: How to insert a bulk of data with unique-violations very fast

Andy Colson <andy@xxxxxxxxxxxxxxx> · Sun, 06 Jun 2010 07:45:34 -0500

On 06/01/2010 10:03 AM, Torsten Zühlsdorff wrote:
Hello,

i have a set of unique data which about 150.000.000 rows. Regullary i
get a list of data, which contains multiple times of rows than the
already stored one. Often around 2.000.000.000 rows. Within this rows
are many duplicates and often the set of already stored data.
I want to store just every entry, which is not within the already stored
one. Also i do not want to store duplicates. Example:

Already stored set:
a,b,c

Given set:
a,b,a,c,d,a,c,d,b

Expected set after import:
a,b,c,d

I now looking for a faster way for the import. At the moment i import
the new data with copy into an table 'import'. then i remove the
duplicates and insert every row which is not already known. after that
import is truncated.

Is there a faster way? Should i just insert every row and ignore it, if
the unique constrain fails?

Here the simplified table-schema. in real life it's with partitions:
test=# \d urls
Tabelle »public.urls«
Spalte | Typ | Attribute
--------+---------+-------------------------------------------------------
url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
url | text | not null
Indexe:
»urls_url« UNIQUE, btree (url)
»urls_url_id« btree (url_id)

Thanks for every hint or advice! :)

Greetings from Germany,
Torsten

I do this with a stored procedure.  I do not care about speed because my db is really small and I only insert a few records a month.  So I dont know how fast this is, but here is my func:

CREATE FUNCTION addentry(idate timestamp without time zone, ilevel integer) RETURNS character varying
AS $$
declare
    tmp integer;
begin
    insert into blood(adate, alevel) values(idate, ilevel);
    return 'ok';
exception
    when unique_violation then
        select into tmp alevel from blood where adate = idate;
        if tmp <> ilevel then
            return idate || ' levels differ!';
        else
            return 'ok, already in table';
        end if;
end; $$
LANGUAGE plpgsql;

Use it like, select * from addentry('2010-006-06 8:00:00', 130);

I do an extra check that if the date's match that the level's match too, but you wouldnt have to.  There is a unique index on adate.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance