Re: [RFC PATCH 10/10] gitweb: Show appropriate "Generating..." page when regenerating cache (WIP)

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 25 Jan 2010 21:58:28 +0100

On Mon, Jan 25, 2010 at 14:56 +0100, Petr Baudis wrote:
> On Mon, Jan 25, 2010 at 02:48:26PM +0100, Jakub Narebski wrote:

> > Now those patches (mine and J.H. both) make gitweb use locking
> > (it is IIRC configurable in J.H. patch) to make only one process
> > generate the page if it is missing from cache, or is stale.  Now
> > if it is missing, we have to wait until it is generated in full
> > before being able to show it to client.  While it is possible to
> > "tee" output (using PerlIO::tee, or Capture::Tiny, or tie like
> > CGI::Cache) writing it simultaneously to browser and to cache for 
> > the process that is generating data, it is as far as I understand
> > it impossible for processes which are waiting for data.  Therefore
> > the need for "Generating..." page, so the user does not think that
> > web server hung or something, and is not generating output.
> 
> Ah, ok, so the message is there to cover up for a technical problem. ;-)
> I didn't quite realize. Then, it would be great to tweak the mechanisms
> so that the user does not really have to wait.

Well, the mechanism would certainly be configurable in final version
(current split version is more of proof of concept of splitting).

> 
> So, I wonder about two things:
> 
> (i) How often does it happen that two requests for the same page are
> received? Has anyone measured it? Or is at least able to make
> a minimally educated guess? IOW, isn't this premature optimization?

To be more exact the question is how often second request for the same
page appears when earlier request didn't finished processing.  It is
the matter of both frequency of given requests, and time it takes to
generate request (which grows with growing load on server).

As to measurements: Pasky, do you have access logs, or their analysis
a la AWStats, Webalizer and the like, for repo.or.cz?  Warthog9, do
you have access logs or analysis for git.kernel.org?  Can you get similar
from fedorahosted?

> (ii) Can't the locked gitwebs do the equivalent of tail -f?

Well, it could, in principle, but it would need some changes.  

First, instead of using temporary file to create cache entry atomically
(write to temporary file, then rename) the process generating data would
have to write to file other processes can read from.  It could be e.g.
lockfile.

Second, there would be needed extended cache API so that generated data
is streamed to cache file, ->set($key, $data) ==> ->set($key, $fh) or
->set_io($key, $fh).  This would mean some complications, but what might
be more important is that this trick would not work as far as I can see
with other caching backends / caching engines that the one from 
gitweb/cache.pm (like memcached or mmap based ones).

Then the code could look like the following (in pseudocode):

  try to acquire writers lock
  if (acquired writers lock) {
  	generate and "tee" response
  	create cache entry
  } else {
  	# <<<<<<
  	while (not acquired writers lock &&
  	       sysread something) {
  		print <data>;
  	}
  	# >>>>>>
  	retrieve and print (rest) of data
  }

where parts between <<<<<< and >>>>>> are new.

But there is another complication: gitweb needs to be able to deal with
the situation where process generating data got interrupted before 
creating full output, or process generating data ran die_error which
does not generate any cache entry (e.g. if the URL we are trying to
access returns 404 not found - the check for existence of object can
take a while if the system is busy, I think).  

Now in current implementation either cache entry is written in full, or it
is not written at all.  It would be, I think, fairly easy to check with the
current code whether cache entry got generated when we acquired readers
lock (when the process get terminated, the lock gets released, which
is advantage over using atomic creating file with O_EXCL for locking),
and if we didn't repeat the whole process.  With the "tee"/"tail" 
solution if the process generating data got interrupted before end,
we can detect such situation, but currently I have no idea what should
be done in such situation.  We can as easily as for the current solution
(which needs "Generating..." page for activity indicator) to detect
die_error situation, and with some care i.e. with not writing to cache
file directly we can ensure that cache entries contain full, correctly
generated data.

> 
> P.S.: Again the disclaimer - if this is "too hard", it's better to
> accept patches like they are, then improve this later. But perhaps
> a better solution would be not to clutter the code by optimizing this
> case at all if it's not clear it really matters in the real world.

See above.

P.S. I have noticed that with current implementation (well, I am not sure
if it is true also for J.H. implementation) there is problem if there is
more than one process trying to request URL which result in die_error being
called.  The design decision, present in original patch, was to not cache
"die_error" / non-"200 OK" pages; it seems sane, but I don't know if it
was a correct decision.  The solution for interrupted generating process,
described above, works also for die_error pages, although it makes 
die_error pages slower for such (hopefully rare) situation of simultaneous
errorneous request.

P.P.S. Both Pasky's approach to caching projects_list page, and Lea Wiemann
work on "gitweb caching" project for Google Summer of Code 2008 approached
caching in different way: by caching (parsed) data, not by caching output.
Note however that for some actions like 'snapshot' we would probably want
to have response/output caching anyway.  Also for output caching we can
use X-Sendfile (or like) extension.

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html