Re: mod-cgi reads entire output into memory...

"Nadav Har'El" <nyh@xxxxxxxxxxxxxxxxxxx> · Tue, 17 Jul 2012 07:50:53 +0300

On Mon, Jul 16, 2012, Nick Kew wrote about "Re:  mod-cgi reads entire output into memory...":
> On Mon, 16 Jul 2012 17:07:23 +0300
> Nadav Har'El <nyh@xxxxxxxxxxxxxxxxxxx> wrote:
> > I looked at the httpd code, discovered (if I understand correctly) that
> > 1. As I already guessed, Apache doesn't let the CGI write directly to the
> > socket, but rather asks it to write to a pipe, which Apache then reads.
> 
> Yep.  That's what CGI is all about.

:-)

I've set out to write a simple mod_cgi replacement which lets the child
process write its output directly to the client socket - it behaves like
NPH (no possibility for Apache to fix the CGI's headers or to filter its
output) but I think it will be useful in a lot of cases (who filters
CGI output anyway?) - and certainly more efficient in mine. In fact
I wonder why it shouldn't always work like that with NPH.

I "almost" have such code, but ran into a mystery - where in the
request_req can I find the client socket, so I can write to it directly?
There are so many layers of output filters, APR, etc., that I can't seem
to find this simple thing...

> > 2. When Apache reads this data from the pipe, it doesn't write it directly
> > but rather just adds it to a "bucket brigade" which collects more and
> > more data.
> 
> No, it doesn't collect more and more data, unless some filter needs to
> buffer the entire output.  Normally it passes data down the chain.
> Each filter's job is to process a chunk of data then pass it to the next.

In my tests definitely all the data was being collected, and I was not
using any output filter (at least not that I know of) - not using
deflate or anything of that sort.

I'm no longer sure about my original statement that the buffering happens
when the client reads the output slowly. In fact, it now looks to me
that extreme memory use actually happens when the client reads very
very quickly (i.e., the client is through localhost). I haven't got a
clue why this is happening - I don't suppose Apache has any time-based
bucket-brigade flow control or memory pool reuse algorithms?

> > I confirmed that this is indeed a flow-control problem by changing the
> > CGI to sleep for 1 second after outputting each 64 MB (i.e., 8 batches
> > of 64 MB output); Now, the memory usage was around 64 MB, not 512 MB,
> > because Apache had the time to output each batch and free its memory
> > before the next batch came.
> 
> Sounds like the entire contents of the pipe got read into memory in
> a single read!  Not good, but not as bad as you think.

Is this actually possible? Doesn't Apache allocate a relatively small
buffer and read into that? How can it read 512 MB in a single read?

> Sleeping is a drastic workaround.  What happens if you just flush your
> CGI output every 8Mb (or, preferably, in smaller chunks than that)?

The CGI is a trivial one written in C, using stdio and puts()'ing 8192
strings of 65536 bytes each. I don't think that stdio buffers 512 MB
or anything close to that. Also like I said, the CGI program itself does
NOT grow in memory use - just Apache.

> You might want to look at the mod_proxy framework as an alternative harness
> to run your program.

Interesting idea. I'll take a look at that.

Thanks,
Nadav.

-- 
Nadav Har'El                        |     Tuesday, Jul 17 2012, 27 Tammuz 5772
nyh@xxxxxxxxxxxxxxxxxxx             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Despite the cost of living, have you
http://nadav.harel.org.il           |noticed how it remains so popular?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx