Re: How to insert a bulk of data with unique-violations very fast

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Pierre C schrieb:
Since you have lots of data you can use parallel loading.

Split your data in several files and then do :

CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...

Use a TEMPORARY TABLE for this : you don't need crash-recovery since if something blows up, you can COPY it again... and it will be much faster because no WAL will be written.

That's a good advice, thank yo :)

If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process per core, and avoid writing WAL, it will scale.

This doesn't solve the other half of your problem (removing the duplicates) which isn't easy to parallelize, but it will make the COPY part a lot faster.

Note that you can have 1 core process the INSERT / removing duplicates while the others are handling COPY and filling temp tables, so if you pipeline it, you could save some time.

Does your data contain a lot of duplicates, or are they rare ? What percentage ?

Within the data to import most rows have 20 till 50 duplicates. Sometime much more, sometimes less.

But over 99,1% of the rows to import are already know. This percentage is growing, because there is a finite number of rows i want to know.

In my special case i'm collection domain-names. Till now it's completly for private interests and with normal pc-hardware. I'm collecting them by crawling known sites and checking them for new hosts. Maybe i will build later an expired domain service or an reverse ip database or something like that. But now i'm just interested in the connection of the sites and the structure people choose domain-names.

(Before someone ask: Till now i have more rows than domains (nearly) exists, because i collect subdomain of all levels too and do not delete entries)

Thanks everyone for your advices. This will help me a lot!

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 verschiedenen Datenbanksystemen abstrahiert, Queries von Applikationen trennt und automatisch die Query-Ergebnisse auswerten kann.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux