Re: reducing IO and memory usage: sending the content of a table to multiple files

Ivan Sergio Borgonovo <mail@xxxxxxxxxxxxxxx> · Fri, 3 Apr 2009 11:09:56 +0200

On Fri, 3 Apr 2009 02:05:19 +0100
Sam Mason <sam@xxxxxxxxxxxxx> wrote:

> On Thu, Apr 02, 2009 at 09:48:33PM +0200, Ivan Sergio Borgonovo
> wrote:
> > I didn't find any elegant example of cursor use in PHP... OK PHP
> > is not the most elegant language around... but still any good
> > exapmle someone could point me at? 

> You can obviously increase the "FETCH" upwards and if you're
> feeling fancy you could even run the FETCH async from the code
> that processes the results.  Maybe something like:

What kind of advantage should I get running asynchronously?
oh I didn't mean you were suggesting any advantage... just wondering.
It could be an option if once everything is up I want to keep under
control resources sucked by this process (?).

> > So I think the largest cost of the operation will be IO.
> > \copy should be optimised for "raw" data output, but maybe all
> > its advantages get lost once I've to use pipes and adding
> > complexity to filtering.

> Streaming IO is pretty fast, I think you'll be hard pushed to keep
> up with it from PHP and you'll end up CPU bound in no time.  Be
> interesting to find out though.

Filtering is currently very simple... I'm building a very simple xml
just queueing constant strings and what comes out of the DB.
But if I had to parse a CSV (split) or just assign names to columns
(and I expect this stuff is going to be adjusted frequently) or use
regexp... I was expecting to waste more human cycles or CPU cycles
than avoiding to rely on optimised IO of \copy (if any).
Most of the operations end up being:
$output=SOMEHEADER;
gzwrite($f1);
gzwrite($f2);
gzwrite($f3);
gzwrite($f4);
while(...) {
 $record1='<sometag
 someattr>.SOMECONST1.$row['col4'].</sometag><someother 
 someattr>'.$row['col3'].'</someothertag>';
 $record2='<sometag
 someattr>.SOMECONST2.$row['col4'].</sometag><someother
 someattr>'.$row['col3'].'</someothertag>';
 gzwrite($f1);
 gzwrite($f2);
 gzwrite($f3);
 gzwrite($f4);
}
$output=SOMEFOOTER;
gzwrite($f1);
gzwrite($f2);
gzwrite($f3);
gzwrite($f4);

I've the largest table of my DB to be sliced into multiple xml files
that have to be written on disk.
So actually 1) reading the whole table and returning 30% of its
fields 2) writing all these data multiple times.
This is by far the largest write load the server is going to incur
in a day.
But well it may be the largest CPU load it is going to incur in a
day as well considering I've to gzip all the files.
Still I think I've read on this list that compression was going
to be a bottleneck more than IO.
I just did a preliminary test and xml-ing and gzipping 80K records
out of 1M takes less than 2sec.
So maybe I was over concerned.

Anyway I'd like to understand a bit better how IO and memory
consumption is managed once you've cursor vs. plain select and
client drivers in the middle.

> > I was reading about all the php documents and trying to
> > understand how buffers and memory usage works, so I gave a look
> > to MySQL documents too...
> 
> Not sure about PG, but the C api pretty much always buffers
> everything in memory first.  There was mention of getting control
> of this, but I've got no idea where it got.

buffer *everything*?

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general