[Please Cc: me directly, as I am not subscribed to git mailing list, and GMane NNTP (news) interface I use doesn't show currently any new posts; I wouldn't want to miss any response. Thanks in advance.] This post shows my ideas about how to implement caching in gitweb, my thoughts on what are the problems, and what solutions (what code) can we use. >From what I remember of discussion about gitweb performance and bottlenecks on git mailing list, the main culprit is projects list (which perhaps should be redesigned), and summary pages for some of larger / more popular projects. Gitweb performance is I/O bound, not CPU bound, so I guess not all existing caching solutions and ideas would work with gitweb. There are some troubles with adding generic (as opposed to site-specific) caching to gitweb. First, gitweb should work both with mod_perl and as CGI (perhaps in the future also FastCGI) script. Second, the solution should not depend on additional packages, at least not those that can be found packaged in extras or well trusted contrib repositories; not all admins allow installing packages from CPAN. Third, the solution could be helped but should not depend on adding helper hooks to users repositories; while hosting sites like repo.or.cz controls repositories, sites like kernel.org or freedesktop.org, which give shell access, do not. Let talk first about what to cache. 1. Support for caching in HTTP (HTTP accelerators, caching engines) My first idea of adding caching support to gitweb was for it to generate proper "freshness" caching headers (Last-Modified: and ETag:) and respond to cache validation requests (If-Modified-Since:, If-None-Match: etc.), and for reverse proxy, aka. caching engine, aka. web accelerator/HTTP accelerator (e.g. Varnish or Squid) take care of caching. It is better to use existing solution, isn't it? Unfortunately gitweb to generate date for If-Modified-Since:, or a tag for If-None-Match: must do hard work; perhaps not as much as generating the whole page in the term of CPU, but almost the same in the terms of I/O hit. So it is not so simple... Nevertheless even if using reverse proxy for gitweb caching is not so simple, gitweb should play well with support for caching in HTTP protocol, so the pages can be cached between gitweb and user, either in one of intermediate caches, proxy server with caching support, or browser cache. Currently gitweb uses 'Expires:' header with expiry of 24h / 1d (IIRC cutoff time for caches; also IIRC forever is half a year) for pages which we know would not change (using full SHA-1 identifier and all required information filled). We should probably generate Last-Modified: and/or ETag: if it is possible. However if gitweb has some kind of internal caching turned on, it can respond properly to validation requests with low cost. This way some of requests would be handled by intermediate caches, so gitweb wouldn't have even to access the cache to return an answer. But IMHO that is a secondary concern: it could help, but isn't possible to do well without in-gitweb caching (as far as I can see). BTW. besides optionally marking result as being retrieved from cache ("stale" or "cached"), gitweb I think should also send appropriate Warning: header, see sections 13.1.2 and 14.46 of RFC2616, e.g. Warning: 110 git.kernel.org "Response is stale" References: * "Caching Tutorial for Web Authors" http://www.mnot.net/cache_docs/ * HTTP 1.1 Specification (RFC 2616) http://www.ietf.org/rfc/rfc2616.txt 2. Caching Perl structures On of solutions (used for example by Petr 'Pasky' Baudis in his last post about caching projects list info in gitweb) is be to cache (save) Perl structures containing information needed to generate response (web page). Another solution, discussed below, would be to cache generated output, i.e. web page, optionally with (some) HTTP headers. The advantage of storing Perl structures (raw data) in the cache is that the same data can be reused for different pages (e.g. paging projects list if/when it gets implemented), same page with varying part (e.g. content type being text/html or application/xhtml+xml depending on what web browser prefers, or transparently compressed output via Transfer-Encoding: depending on web browser capabilities), and for replying to cache validation requests. Additionally we can generate web pages with correct relative (e.g. "5 minutes ago") time info. Not that all but first and last are not possible with caching [final] output, but it would be, I think, much harder... The disadvantage is that we have to decide is what format to use for serializing data, i.e. to represent compound complex data as stream of bytes in cache... unless of course gitweb would rely on one of already existing caching solutions, which usually take care of this problem for us: see next section (in next installment). Formats I was considering were: - Data::Dumper - Storable (binary) - YAML Tiny - gitconfig tiny 2.1. Data::Dumper One of advantages of Data::Dumper format is that it comes with Perl installation, so there is no problem with installing it (well, at least it comes with perl-5.8.6-24 RPM on Linux). Another advantage is that it is textual format, thus easy to debug in the case of problems. Main problem with Data::Dumper is that it use eval() to thaw (restore) data from serialized form in cache, which is serious security risk in less secure environments. 2.2. Storable Also comes packaged with Perl distribution. Offers writing to and restoring from file or to/from opened file handle. It is fast, from my unscientific tests around 3-4 times as fast as using eval() to read Data::Dumper data. One of disadvantages is the fact that Storable format is binary, so you would have to write separate Perl script to convert it to human-readable form (e.g. Data::Dumper form). Format includes format and version header, and modern 'file' installations should detect it correctly, e.g.: filename: perl Storable(v0.7) data (major 2) (minor 6) Another nuisance is the fact that while Storable(3pm) manpage states that: The [retrieve()] routine returns "undef" for I/O problems or other internal error, a true value otherwise. Serious errors are propagated as a "die" exception. But 'serious errors' include the fact that file is not in correct format, so for safety (because server-side script should return page with error info instead of dying silently) one would have to use "eval { ... }" to catch errors. Used by Cache::Cache and, I think, also by other caching solutions (packages). 2.3. YAML Tiny (subset of YAML) YAML was created as human-readable serialization format, easy to parse by machine. Unfortunately none of YAML parsing modules (YAML, YAML::Syck, YAML::Tiny) are packaged with Perl; on the other hand they can be found in Dries RPM repository, so I guess they fill criteria of being in extras or trusted contrib package repository. Additionally at least YAML::Tiny (which implements subset of YAML in pure Perl code) is slower, around 4-5 times, than even using Data::Dumper, and more that 10 times slower than using Storable. This _might_ have been caused by the choice of module to implement YAML parsing. We could write parser (and generator) for even smaller subset of YAML to use only those features that are truly needed by gitweb... but then we can go with the next format, which is also text format, and also doesn't have insecurities of using eval() to thaw data (read from cache). YAML was designed from the ground up to be an excellent syntax for configuration files. Not necessarily so for cache. References: * http://en.wikipedia.org/wiki/YAML * YAML::Tiny(3pm) http://search.cpan.org/dist/YAML-Tiny/lib/YAML/Tiny.pm * YAML Ain't Markup Language (YAML^TM) Version 1.1, Working Draft http://yaml.org/spec/current.html 2.4. gitconfig tiny (subset of ini-like gitconfig format). What, I think, we would want to cache is usually list of records, or in Perl terminology array of hashes; usually ordering of array doesn't matter. Because of that I think it would be possible to represent data to be saved (cached) in the ini-like extended git config format. Then gitweb could either (re)use config parser in Perl used by git-cvsserver (which accepts subset of valid config format), or "git config --file=<path> -z -l" to slurp data in more parseable format... but if we do that, we could choose this format or variation of it as our serialization format. The cache file could look like this: [gitweb "<primary key value>"] key1 = value1 key2 = "value with spaces" where for list of projects info primary key might be path (relative to projectroot) to the repository. 3. Caching output: formatted pages Alternate solution to caching Perl structures is caching final output, with or without (some/all) HTTP headers. It has the advantage that it is simple to implement, and that the same code can be used to cache all the pages. (But we could get similar result by creating something similar to Tie::Memoize, tying hash or array so it automatically get data either from git command, or from cache... or we can implement universal API, like Cache::Cache API.) This is from what I understand what kernel.org (warthog9) gitweb uses; I don't know what cgit (web interface in C) which also has some caching support uses: does it cache data or output? How one can simply extend CGI script with support for caching is shown by the CGI::Cache (non-standard CPAN Perl module). On the other hand gitweb can afford more extensive surgery. References: * CGI::Cache(3pm) http://search.cpan.org/~dcoppit/CGI-Cache-1.4200/lib/CGI/Cache.pm To be continued... %% .................................................................. %% In next parts: Cache lifetime and invalidation 1. static cache, external refreshing e.g. by hooks 2. stat and/or inotify, to check if repository changed 3. cache lifetime (trying to avoid "thundering horde" problem) CPAN packages we could use, or take inspiration from 1. Cache::Cache (standard) 2. CHI, Unified caching interface 3. Cache 4. other (e.g. Cache::Adaptive, using Cache::Cache) -- Jakub Narebski Poland -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html