Re: How to insert a bulk of data with unique-violations very fast

Torsten Zühlsdorff <foo@xxxxxxxxxxxxxxxxxxx> · Mon, 07 Jun 2010 15:21:13 +0200

Pierre C schrieb:
Since you have lots of data you can use parallel loading.

Split your data in several files and then do :

CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...

Use a TEMPORARY TABLE for this : you don't need crash-recovery since if 
something blows up, you can COPY it again... and it will be much faster 
because no WAL will be written.

That's a good advice, thank yo :)

If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process 
per core, and avoid writing WAL, it will scale.

This doesn't solve the other half of your problem (removing the 
duplicates) which isn't easy to parallelize, but it will make the COPY 
part a lot faster.

Note that you can have 1 core process the INSERT / removing duplicates 
while the others are handling COPY and filling temp tables, so if you 
pipeline it, you could save some time.

Does your data contain a lot of duplicates, or are they rare ? What 
percentage ?

Within the data to import most rows have 20 till 50 duplicates. Sometime 
much more, sometimes less.

But over 99,1% of the rows to import are already know. This percentage 
is growing, because there is a finite number of rows i want to know.

In my special case i'm collection domain-names. Till now it's completly 
for private interests and with normal pc-hardware. I'm collecting them 
by crawling known sites and checking them for new hosts. Maybe i will 
build later an expired domain service or an reverse ip database or 
something like that. But now i'm just interested in the connection of 
the sites and the structure people choose domain-names.

(Before someone ask: Till now i have more rows than domains (nearly) 
exists, because i collect subdomain of all levels too and do not delete 
entries)

Thanks everyone for your advices. This will help me a lot!

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance