Re: Benchmark Data requested --- pgloader CE design ideas

Greg Smith <gsmith@xxxxxxxxxxxxx> · Wed, 6 Feb 2008 18:36:13 -0500 (EST)

On Wed, 6 Feb 2008, Dimitri Fontaine wrote:

In fact, the -F option works by having pgloader read the given number of lines
but skip processing them, which is not at all what Greg is talking about here
I think.

Yeah, that's not useful.

Greg, what would you think of a pgloader which will separate file reading
based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
threads: we split into as many pieces as section_threads section config
value.

Now you're talking.  Find a couple of split points that way, fine-tune the 
boundaries a bit so they rest on line termination points, and off you go. 
Don't forget that the basic principle here implies you'll never know until 
you're done just how many lines were really in the file.  When thread#1 is 
running against chunk#1, it will never have any idea what line chunk#2 
really started at until it reaches there, at which point it's done and 
that information isn't helpful anymore.

You have to stop thinking in terms of lines for this splitting; all you 
can do is split the file into useful byte sections and then count the 
lines within them as you go.  Anything else requires a counting scan of 
the file and such a sequential read is exactly what can't happen 
(especially not more than once), it just takes too long.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq