Re: How to insert a bulk of data with unique-violations very fast

Torsten Zühlsdorff <foo@xxxxxxxxxxxxxxxxxxx> · Wed, 09 Jun 2010 09:45:46 +0200

Pierre C schrieb:

Within the data to import most rows have 20 till 50 duplicates. 
Sometime much more, sometimes less.

In that case (source data has lots of redundancy), after importing the 
data chunks in parallel, you can run a first pass of de-duplication on 
the chunks, also in parallel, something like :

CREATE TEMP TABLE foo_1_dedup AS SELECT DISTINCT * FROM foo_1;

or you could compute some aggregates, counts, etc. Same as before, no 
WAL needed, and you can use all your cores in parallel.

 From what you say this should reduce the size of your imported data by 
a lot (and hence the time spent in the non-parallel operation).

Thank you very much for this advice. I've tried it inanother project 
with similar import-problems. This really speed the import up.

Thank everyone for your time and help!

Greetings,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance