On Feb 6, 2008 9:05 AM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote:
On Tue, 5 Feb 2008, Simon Riggs wrote:
> On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
>>>> Even if it is a single core, the mere fact that the loading process willpgloader is a great tool for a lot of things, particularly if there's any
>> eventually wait for a read from the input file which cannot be
>> non-blocking, the OS can timeslice it well for the second process to use
>> those wait times for the index population work.
>
> If Dimitri is working on parallel load, why bother?
chance that some of your rows will get rejected. But the way things pass
through the Python/psycopg layer made it uncompetative (more than 50%
slowdown) against the straight COPY path from a rows/second perspective
the last time (V2.1.0?) I did what I thought was a fair test of it (usual
caveat of "with the type of data I was loading"). Maybe there's been some
gigantic improvement since then, but it's hard to beat COPY when you've
got an API layer or two in the middle.
I think, its time now that we should jazz COPY up a bit to include all the discussed functionality. Heikki's batch-indexing idea is pretty useful too. Another thing that pg_bulkload does is it directly loads the tuples into the relation by constructing the tuples and writing them directly to the physical file corresponding to the involved relation, bypassing the engine completely (ofcourse the limitations that arise out of it are not supporting rules, triggers, constraints, default _expression_ evaluation etc). ISTM, we could optimize the COPY code to try to do direct loading too (not necessarily as done by pg_bulkload) to speed it up further in certain cases.
Another thing that we should add to COPY is the ability to continue data load across errors as was discussed recently on hackers some time back too.
Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com