Pierre C schrieb:
Since you have lots of data you can use parallel loading.
Split your data in several files and then do :
CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...
Use a TEMPORARY TABLE for this : you don't need crash-recovery since if
something blows up, you can COPY it again... and it will be much faster
because no WAL will be written.
That's a good advice, thank yo :)
If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process
per core, and avoid writing WAL, it will scale.
This doesn't solve the other half of your problem (removing the
duplicates) which isn't easy to parallelize, but it will make the COPY
part a lot faster.
Note that you can have 1 core process the INSERT / removing duplicates
while the others are handling COPY and filling temp tables, so if you
pipeline it, you could save some time.
Does your data contain a lot of duplicates, or are they rare ? What
percentage ?
Within the data to import most rows have 20 till 50 duplicates. Sometime
much more, sometimes less.
But over 99,1% of the rows to import are already know. This percentage
is growing, because there is a finite number of rows i want to know.
In my special case i'm collection domain-names. Till now it's completly
for private interests and with normal pc-hardware. I'm collecting them
by crawling known sites and checking them for new hosts. Maybe i will
build later an expired domain service or an reverse ip database or
something like that. But now i'm just interested in the connection of
the sites and the structure people choose domain-names.
(Before someone ask: Till now i have more rows than domains (nearly)
exists, because i collect subdomain of all levels too and do not delete
entries)
Thanks everyone for your advices. This will help me a lot!
Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance