Re: How to insert a bulk of data with unique-violations very fast

Torsten Zühlsdorff <foo@xxxxxxxxxxxxxxxxxxx> · Sun, 06 Jun 2010 14:02:20 +0200

Scott Marlowe schrieb:

i have a set of unique data which about 150.000.000 rows. Regullary i get
a
list of data, which contains multiple times of rows than the already
stored
one. Often around 2.000.000.000 rows. Within this rows are many
duplicates
and often the set of already stored data.
I want to store just every entry, which is not within the already stored
one. Also i do not want to store duplicates. Example:
The standard method in pgsql is to load the data into a temp table
then insert where not exists in old table.
Sorry, i didn't get it. I've googled some examples, but no one match at my
case. Every example i found was a single insert which should be done or
ignored, if the row is already stored.

But in my case i have a bulk of rows with duplicates. Either your tipp
doesn't match my case or i didn't unterstand it correctly. Can you provide a
simple example?

create table main (id int primary key, info text);
create table loader (id int, info text);
insert into main values (1,'abc'),(2,'def'),(3,'ghi');
insert into loader values (1,'abc'),(4,'xyz');
select * from main;
 id | info
----+------
  1 | abc
  2 | def
  3 | ghi
(3 rows)

select * from loader;
 id | info
----+------
  1 | abc
  4 | xyz
(2 rows)

insert into main select * from loader except select * from main;
select * from main;
 id | info
----+------
  1 | abc
  2 | def
  3 | ghi
  4 | xyz
(4 rows)

Note that for the where not exists to work the fields would need to be
all the same, or you'd need a more complex query.  If the info field
here was different you'd get an error an no insert / update.  For that
case you might want to use "where not in":

insert into main select * from loader where id not in (select id from main);

Thank you very much for your example. Now i've got it :)

I've test your example on a small set of my rows. While testing i've 
stumpled over a difference in sql-formulation. Using except seems to be 
a little slower than the more complex where not in (subquery) group by. 
Here is my example:

CREATE TABLE tseq (value text);
INSERT INTO tseq VALUES ('a') , ('b'), ('c');
CREATE UNIQUE INDEX tseq_unique on tseq (value);
CREATE TEMP TABLE tmpseq(value text);
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('d');
INSERT INTO tmpseq VALUES ('d') , ('b'), ('d');
SELECT* from tseq;
 value
-------
 a
 b
 c
(3 rows)

SELECT* from tmpseq;
 value
-------
 a
 b
 c
 a
 b
 c
 a
 b
 d
 d
 b
 d
(12 rows)

VACUUM VERBOSE ANALYSE;

explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq;
                                                      QUERY PLAN 

----------------------------------------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..2.34 rows=4 width=2) (actual 
time=0.157..0.158 rows=1 loops=1)
   ->  Append  (cost=0.00..2.30 rows=15 width=2) (actual 
time=0.012..0.126 rows=15 loops=1)
         ->  Subquery Scan "*SELECT* 1"  (cost=0.00..1.24 rows=12 
width=2) (actual time=0.009..0.060 rows=12 loops=1)
               ->  Seq Scan on tmpseq  (cost=0.00..1.12 rows=12 
width=2) (actual time=0.004..0.022 rows=12 loops=1)
         ->  Subquery Scan "*SELECT* 2"  (cost=0.00..1.06 rows=3 
width=2) (actual time=0.006..0.018 rows=3 loops=1)
               ->  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2) 
(actual time=0.003..0.009 rows=3 loops=1)
 Total runtime: 0.216 ms
(7 rows)

explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT 
value FROM tseq) GROUP BY value;
                                                 QUERY PLAN 

------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2.20..2.22 rows=2 width=2) (actual 
time=0.053..0.055 rows=1 loops=1)
   ->  Seq Scan on tmpseq  (cost=1.04..2.19 rows=6 width=2) (actual 
time=0.038..0.043 rows=3 loops=1)
         Filter: (NOT (hashed SubPlan 1))
         SubPlan 1
           ->  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2) 
(actual time=0.004..0.009 rows=3 loops=1)
 Total runtime: 0.105 ms
(6 rows)

My question: is this an generall behavior or just an effect of the small 
case?

Greetings form Germany,
Torsten

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance