[RFD] Possible improvements for output caching in gitweb

Jakub Narebski <jnareb@xxxxxxxxx> · Sun, 10 Oct 2010 22:32:07 +0200

On Thu, 7 Oct 2010, Jakub Narebski wrote:

> TODO list and areas of possible improvements would be send in separate
> email.

Here they are.  What do you think about them; which are needed, which 
ones would be nice to have, which are not worth the trouble, and what's 
perhaps most important: which ones are missing.  Also in what order of 
importance should they be worked upon.

Note that in the list below there are deliberately missing improvements
to the code (which were already commented on; thanks again).

New features and improvements related directly to cache or capture:

* Ajax-y progress indicator (perhaps inside skeleton of page)
  see 9982b6f (gitweb: Ajax-y "Generating..." page when regenerating
  cache (WIP), 2010-01-24) on 'gitweb/cache-kernel' branch in the
  http://repo.or.cz/w/git/jnareb-git.git repository.

  Instead of relying on http-equiv refresh trick (which uses the fact
  that web browsers render inclomplete page, and that refresh is done
  only after page is received in full), use XMLHttpRequest to get
  (re)generated version of the page, displaying progress info while at
  it, and redraw page when data is received in full.  All of this only
  when JavaScript is enabled, so I guess old trick should be kept as 
  fallback.

  This of course assumes that progress info indicator is important...

* error handler, like in CHI

  Instead of using 'die <message>' and relying on CGI.pm and gitweb
  catching the exception and displaying it (set_message from CGI.pm),
  pass error handler (wrapped die_error) to cache constructor.  In the
  case of CHI caching framework, there are 'on_get_error' and
  'on_set_error' options.

  In the original patches by J.H. subroutines from cache.pm used
  die_error directly; this was possible only because the file was
  loaded using "do 'cache.pm';" as a kind of mixin / role into gitweb
  code.

* $capture option to cache_output()

  Currently in GitwebCache::OutputCache the capture engine used to
  capture gitweb output to cache it is hardcoded; mind you, one would
  need to change code only in two places to use different compatibile
  caching engine, but still it would require changing code.  It would
  be better to pass $capture as parameter to cache_output(), just like
  $cache is.

* POD documentation instead of comments + make doc

  Currently gitweb caching modules  are documented (like original
  cache.pm by J.H.) only in comments, and a bit in gitweb/README.
  Though those modules are used only internally, it might be better
  to use POD (perlpod) for documentation, like in Git.pm.

  We migth also want to add 'doc' target in gitweb/Makefile, though
  it might be difficult (see perl/Makefile.PL and generated perl.mak).

* cache expire variation a la CHI

  CHI (caching interface in Perl) supports 'expires_variance' parameter,
  which according to documentation:

   "Controls the variable expiration feature, which allows items to
    expire a little earlier than the stated expiration time to help
    prevent cache miss stampedes.

    Value is between 0.0 and 1.0, with 0.0 meaning that items expire
    exactly when specified (feature is disabled), and 1.0 meaning that
    items might expire anytime from now til the stated expiration time.
    The default is 0.0. A setting of 0.10 to 0.25 would introduce a
    small amount of variation without interfering too much with intended
    expiration times."

  See http://p3rl.org/CHI (or CHI manpage, if you have it installed).
  This feature is about *avoiding* cache miss stampede, while locking
  is used to ensure that only one process is regenerating cache for
  a given entry.

* benchmarks for different caches under light and under heavy load;
  profiling of gitweb with caching using Devel::NYTProf.

  The problem is to both prepare repositories, and to generate traffic
  (or generate IO pressure) to represent real-life situation, where
  supposedly gitweb is IO bound, rather than CPU bound.

-------------------------------------------------------------
Below there are cache related improvements that require for 
GitwebCache::CacheOutput to be aware that it caches HTTP response,
which consist of HTTP headers (text) separated by an empty line
from a body of a request (which can be binary).

This can be done either by parsing response or a retrieved cache entry, 
or by storing headers and body separately, or by using some Perl data 
structure (like for example the one used by PSGI) and storing it 
serialized (though serialization can affect performance).

* X-Sendfile (or equivalent) support

  Web server encountering such HTTP header will discard all output and
  send the file specified by that header instead using web server
  internals including all optimizations like caching-headers and
  sendfile or mmap if configured.  For Apache it requires mod_xsendfile
  module (https://tn123.org/mod_xsendfile/), lighttpd has it build in
  (at least for FastCGI) but disabled by default; in Nginx similar
  feature is called X-Accel-Redirect.

  The idea is to use cache file for X-Sendfile contents; though this
  might require storing headers and body of response separately, and
  might be not much of speedup.

* compressed cache entries (transfer-encoding) (?)

  To reduce size taken by cache, and also reduce bandwidth taken by
  serving gitweb requests, save body of response compressed.  Then,
  if browser supports it, send compressed data with the HTTP header
  'Transfer-Encoding:' set to appriopriate value.

  The complication which, I think, we have to take into account is
  that some (hopefully small amount) of web browsers and net downloaders
  doesn't support transfer-encoding we plan to use (gzip or deflate).
  Also gitweb should compress file which it knows to not compress well,
  like already compressed snapshots (zip, tar.gz, tar.bz2) or images.

  There was patch " gitweb: Enable transparent compression for HTTP
  output" sent to git mailing list (using PerlIO::gzip), but in the
  cached case we pay CPU cost only *once*.

* Replace text/html with application/xml+xhtml in header
  when reading from cache.

  In the non-cached case, gitweb served page using either plain
  'text/html' content type, or if web browser accepts it more advanced
  'application/xhtml+xml' content type.  When caching is enabled, we
  had to always use 'text/html', because web browser (e.g. lynx) might
  not accept the other... but with cache being HTTP-aware, we can
  replace 'text/html' with 'application/xhtml+xml' in 'Content-Type:'
  HTTP header.

* Expires-In / cache-age synchronized with cache lifetime,
  Last-Modified synchronized with cache entry creation time.

  Currently all cache entries have the same global (per cache instance)
  expiration time.  The Expire header is not correlated with it.

  There are two issues: when storing data in cache, we can set Expire
  header (or cache-age pragma in Cache-Control header) to the expiration
  time of cache entry and set Last-Modified to the time cache entry was
  (re)generated (unless it is already set by gitweb).

  The other issue is that some data doesn't change, ever.  We set expire
  time to '+1d' (one day) in such case.  If we could mark those cache
  entries as having longer / infinite lifetime to not regenerate them...

* support for If-Modified-Since (external/browser caches)

  When caching is enabled, we know when page was created.  We could
  check for If-Modified-Since conditional request header, and return
  '304 Not Modified' HTTP response if we would serve from the same
  cache entry.  It would save bandwidth, and a bit of I/O.

* ETag support - gitweb version + cache key hash, possibly also Range
  requests.

  We can compose strong ETag validator from cache key hash and gitweb
  version string.  Maybe it would make possible to respond to Range
  requests for resuming download of e.g. large snapshot file...

  But it might be the fact that those features are unrelated...  

----------------------------------------------------------------------
Below there are proposed gitweb improvements and features, which would 
also improve caching support in gitweb:

* Time::HiRes is in core + simplify progress indicator

  Time::HiRes was first released with perl 5.007003 (5.7.3).  Because
  gitweb requires at least Perl 5.8 for its Unicode / UTF-8 support,
  we can assume that it is present.

  This would simplify code in git_generating_data_html()

* $per_request_config = 0/1 (default)/coderef
  or just $per_request_config = coderef

  If it would be possible to have config read only once in persistent
  environments such as mod_perl and FastCGI, and not once per request,
  it would improve performance when caching engine used has slow
  initialization / creation time, like Moose-based CHI.

  The basic idea is to put parts of config that change per request (like
  e.g. gitosis or gitolite uses) in coderef in $per_request_config
  variable; this coderef would be invoked once per config.  Example
  configuration:

    our $per_request_config = sub {
       $ENV{GL_USER} = ($cgi && $cgi->remote_user) || "gitweb";
    };

* authenthication / authorization for admin stuff

  Some kind of authenthications support would be needed for edit / write
  support in gitweb, and also for controlling access to the cache
  administration page.  We don't want anyone to be able to clear cache.

  In the current proof-of-concept patch the cache administration page
  is restricted to people accessing gitweb pages from localhost, or
  running gitweb as a standalone script.

* mod_perl handler

  It might be possible with altering / modifying gitweb only slightly to
  make it work *also* as native mod_perl handler, and not only via
  ModPerl::Registry.  

  This would make possible to initialize cache once per process
  lifetime, and not once per request.

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html