On Wed, 6 Feb 2008, Dimitri Fontaine wrote:
In fact, the -F option works by having pgloader read the given number of lines but skip processing them, which is not at all what Greg is talking about here I think.
Yeah, that's not useful.
Greg, what would you think of a pgloader which will separate file reading based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of threads: we split into as many pieces as section_threads section config value.
Now you're talking. Find a couple of split points that way, fine-tune the boundaries a bit so they rest on line termination points, and off you go. Don't forget that the basic principle here implies you'll never know until you're done just how many lines were really in the file. When thread#1 is running against chunk#1, it will never have any idea what line chunk#2 really started at until it reaches there, at which point it's done and that information isn't helpful anymore.
You have to stop thinking in terms of lines for this splitting; all you can do is split the file into useful byte sections and then count the lines within them as you go. Anything else requires a counting scan of the file and such a sequential read is exactly what can't happen (especially not more than once), it just takes too long.
-- * Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq