Re: Gitweb caching: Google Summer of Code project

Jakub Narebski <jnareb@xxxxxxxxx> · Wed, 28 May 2008 14:14:35 +0200

On Wed, 28 May 2008, Lea Wiemann wrote:
> Jakub Narebski wrote:
>>
>> Lately he posted a patch
>> implementing projects list caching, in a bit different way from how it
>> is done on kernel.org, namely by caching data and not final output:
> 
> Thanks for this and all the other pointers.
> 
> Caching data and not final output is actually what I'm about to try 
> next.

Caching data have its advantages and disadvantages, same as with
caching HTML output (or parts of HTML output). I have wrote about
it in
  http://thread.gmane.org/gmane.comp.version-control.git/77529

Let me summarize here advantages and disadvantages of caching data
and of caching HTML output.

1. Caching data
 * advantages:
   - smaller than caching HTML output
   - you can use the same data to generate different pages
     ('summary', 'shortlog', 'log', 'rss'/'atom'; pages or search
      results of projects list)
   - you can generate pages with variable data, such as relative dates
     ("5 minutes ago"), staleness info ("cached data, 5 minutes old"),
     or content type: text/html vs application/xhtml+xml
 * disadvantages:
   - more CPU
   - need to serialize and deserialize (parse) data
   - more complicated

2. Caching HTML output
 * advantages:
   - simple, no need for serialization (pay attention to that in mixed
     data + output caching solutions)
   - low CPU (although supposedly[1] gitweb performance is I/O bound,
     and not CPU bound)
   - web servers deals very well with static pages
   - web servers deals with support for HTTP caching (giving ETag and
     Last-Changed headers, responding to If-Modified-Since, 
     If-None-Match etc. headers from web browsers and caching proxies)
 * disadvantages:
   - large size of cache data (if most clients support compression, you
     can store it compressed, at the cost of CPU for non-supporting ones).
   - difficult to impossible variable output (for example you can still
     rewrite some HTTP headers for text/html vs application/xhtml+xml
     or store headers separately, you can use JavaScript to change
     visible times from absolute to relative dates)

I'm sure John, Lars and Petr can tell you more, and have more experience.

[1] Some evidence both from warthog9 and pasky, but no hard data[2]
[2] I think it would be good to start with analyse of gitweb statictics,
    e.g. from Apache logs, from kernel.org and repo.or.cz.

> If I'm not mistaken, the HTML output is significantly larger than  
> the source (repository) data; however, kernel.org still seems to benefit 
> from caching the HTML, rather than letting Linux' page cache cache the 
> source data.

I don't think kernel.org caches _all_ pages, only the most requested
(correct me if I'm wrong here, John, please).

> That leads me to think that the page cache somehow fails  
> to cache the source data properly -- I'm not sure why (wild speculation: 
> perhaps because of the pack format).

>From what I remember one of most costly to generate pages is projects
list page (that is why Petr Baudis implemented caching for this page
in repo.or.cz gitweb, using data caching here).  With 1000+ projects
(repositories) gitweb has to hit at best 1000+ packfiles, not to
mention refs, to generate "Last Changed" column from git-for-each-ref
output (accidentally, also to check if it is truly git repository).
In kernel.org case with gitweb working similar to mod_userdir module
but for git repositories (as a service, rather than as part of repo
hosting), gitweb has to hit 1000+ 'summary' files...  That is
interspersed with other requests.

How page cache and filesystem buffers can deal with that?

BTW I'm not sure if kernel.org use CGI or "legacy" mod_perl gitweb;
curently there is no support for FastCGI in gitweb (although you can
find some patches in archive).

(But I'm not an expert in those matters, so please take the above
with a pinch of salt, or two).

By the way using pack files besides reducing repository size also
improved git performance thanks to better I/O performance and better
working with filesystem cache (some say that git is optimized for
warm cache).

> Anyway, I'd hope that I can  
> encapsulate the 30-40 git_cmd calls in gitweb.perl and somehow cache 
> their results (or, to save memory, the parts of their results that are 
> actually used) and cache them using memcached.  If that works well, we 
> can stop bothering about frontend (HTML) caching, unless CPU becomes an 
> issue, since all HTML pages are generated from cacheable source data.

I don't think caching _everything_, including rarely requested pages,
would be a good idea.

> I'm *kindof* hoping that in the end there will be only few issues with 
> cache expiry, since most calls are uniquely identified through hashes. 
> (And the ones that are not, like getting the hash of the most recent 
> commit, can perhaps be cached with some fairly low expiry time.)

The trouble is with those requests which are _not_ uniquely identified
by hashes requested, such as 'summary', 'log' from given branch (not
from given hash), or web feed for given branch.  For those which are
not-changing you can just (as gitweb does even now) give large HTTP
expiry (Expires or max-age) and allow web browser or proxies to cache
it.

> So that's what I'll try next.  If you have any comments or warnings off 
> the top of your heads, feel free to send email of course. :)

I'm afraid that implementing kernel.org caching in mainline in
a generic way would be enough work for a whole GSoC 2008.  I hope
I am mistaken and you would have time to analyse and implement wider
reange of caching solutions in gitweb...

>> the main culprit of [the fork] was splitting gitweb into many, many
>> files.  While it helped John in understanding gitweb, it made it
>> difficult to merge changes back to mainline.
> 
> Interesting point, thanks for letting me know.  (I might have gone ahead 
> and tried to split the mainline gitweb myself... ^^)  I think it would 
> be nice if gitweb.perl could be split at some point, but I assume there 
> are too many patches out there for that to be worth the merge problems, 
> right?

On one hand gitweb.perl in single file makes it easy to install; on the
other hand if it was split into modules (like git-gui now is) it would
I think be easier to understand and modify... I think however that it
would be better to first make gitweb use Git.pm, adding improving Git.pm
when necessary (for example adding eager config parsing used in gitweb,
i.e. read whole config into Perl hash at first request, then access hash
instead of further calls to git-config).

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html