[RFC] Implementing gitweb output caching - issues to solve

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 4 Nov 2010 18:21:51 +0200

First, what is the status of gitweb output caching?  I have send minimal
fixup for latest John 'Warthog9' Hawley series as Gitweb caching v7.1[1].
John said that he can work on fixes on November 2 at earliest.

[1] http://thread.gmane.org/gmane.comp.version-control.git/160147/focus=160473

During working on my rewrite of gitweb output caching, based on ideas
(and some code) from John 'Warthog9' Hawley (J.H.) series I have met
with the following issues:

1. Error handling

   There are three separate places where we can encounter error during
   caching.  They are:

    A. setting up cache (initialization)
    B. retrieving data from cache
    C. storing (saving) data in cache.

   There are two possible choices for errors during cache initialization
   (case A):
     * turn off caching
     * show 500 Internal Server Error

   The problem with turning off caching is that it hides an error, and
   admin can not notice that caching does not work.  Therefore I think
   that better solution would be to generate error, especially that such
   errors are not, usually, incidental.

   Note that in my opinion cache should not be initialized if caching is
   disabled.

   The modern CHI caching interface provides a way to change how runtime
   errors are handled via 'on_get_error' and 'on_set_error' parameters to
   constructor.  The default for CHI is to log error, or ignore if no
   logger is set.  You can provide a coderef (callback).

   For errors during getting data from cache (case B), we can ignore error
   which means generating data as if cache was disabled/entry did not
   exist in cache, or erroring out and showing error to user.  Such errors
   should be rare, so erroring out might be appropriate reaction.  Note
   that we have special case of zero sized cache entry file, in which case
   we treat it as non-existent entry, already.

   For errors during setting (saving) data to cache (case C), we can
   ignore error and not re-generate cache (not even calling callback if
   error is detected early), or error out.  Such errors can happen for
   example if filesystem gets full.  It might be better to ignore such
   errore, at least in the case of ENOSPC / ENOMEM.

2. Progress info and not caching error pages

   J.H. find out or invented very nice hack to generate progress info
   page without Ajax (and therefore without need for JavaScript), to make
   user aware that data is being regenerated, and that he/she should be
   patient instead of refreshing or complaining.

   This trick uses the fact that the internal redirect using meta
   refresh[2]

      <meta http-equiv="refresh" content="0" />

   does not occur untill the web browser gets full page (which means that
   connection get closed / end of request).

   [2] http://en.wikipedia.org/wiki/Meta_refresh

   Besides the fact that, as J.H. wrote

    JH> There are still a few known "issues" with respect to this:
    JH> - Code needs to be added to be "browser" aware so
    JH>   that clients like wget that are trying to get a
    JH>   binary blob don't obtain a "Generating..." page

   there is additional problem, at least in my rewrite.  (I didn't follow
   the code flow in J.H. v7 patch series to check if it also affected).

   Let's assume that 'progress_info' feature is turned on, and that cache
   is generated in background.  Let's assume that error pages (die_error)
   are not cached.

   Now client tries to access page which would/will result in an error.
   With caching it acquires exclusive (writers) lock, check that there
   are no stale data (error pages are never cached), checks that it is
   provided with 'generating_info' subroutine, so it forks a process to
   generate data in background.

   Background process detaches, tries to generate data to cache,
   die_error is called.  die_error turns off capturing, and prints to
   STDOUT.  At the end of die_error there is jump to outer scope to
   DONE_GITWEB; therefore $lock_fh goes out of scope and lockfile is
   closed, and lock released.  Background process finishes.

   Parent process runs 'generating_info' subroutine.  Now if it waits a
   bit like in my rewrite before employing trick mentioned above, and
   does not print anything if lockfile is released before startup delay,
   _and_ die_error finishes within this delay, then everything is all
   right: the error message is sent to client.

   If die_error doesn't finish before git_generating_data_html prints
   meta refresh, or there is no startup delay, then error pages would get
   infinite redirects (remember: there is never anything in cache for
   error pages).  This is a bad thing.

   One possible solution for this problem (beside employing startup
   delay) is to have tee output, i.e. print it as it is being captured.
   Streamed (partial) response would serve as progress indicator for
   process (re)generating cache; only parallel processes waiting for
   cache would show 'generating_info'.

   I think that in current implementation of capturing it would be as
   simple as not closing STDOUT, but I'd have to check that.

3. Using generic cache engine and memory consumption

   Most Perl caching interfaces support only $cache->set($key, $data),
   where $data is a Perl variable, and $data = $cache->get($key), or
   their equivalents.  Even for file-based cache drivers you save from
   memory and read into memory.  

   The only exception I know of is Cache interface with Cache::File
   driver, that provides $cache->handle($key [, $mode]), where optional
   $mode argument can be any of mode strings that can be used in 'open'
   function, or one of fopen(3) modes.  (Of course for memory-based or
   network-based (like memcached) caches it might not make sense to
   provide such interface).

   The most often recommended capture module, Capture::Tiny, allows only
   capturing into scalar (into memory) 

     $stdout = capture \&code;

   or (using its prototype)

     $stdout = capture { <code> };

   Well, it is *::Tiny, so it supports minimal API.  From the list of
   different capture modules, that allow capturing of Perl code output,
   with different interfaces (API), in "See also" section of
   Capture::Tiny documentation[3], only IO::CaptureOutput allow capturing
   into specified file:

     capture \&code, \$stdout, undef, $outfile;

   [3] http://p3rl.org/Capture::Tiny

   J.H.'s gitweb output caching v7 captures output directly ito cache
   files.  The problem with doing it in my rewrite is to allow capturing
   directly into cache entry file without losing ability to select
   different caching engine, which might be not file-based (like
   e.g. memcached-based).

P.S. You can always check out my latest work on gitweb output caching
rewrite here (warning: rebased!)

  git://repo.or.cz/git/jnareb-git.git gitweb/cache-kernel-pu

(it is also available in git://github.com/jnareb/git.git)

-- 
Jakub Narębski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html