In previous part: What to cache. 1. Support for caching in HTTP (external caching) 2. Caching Perl structures (and serialization) 3. Caching gitweb output: formatted pages TO WHOM IT MAY CONCERN: John 'Warthog9' Hawley (J.H.) who created caching for gitweb at kernel.org; Petr 'Pasky' Baudis who maintains repo.or.cz fork of gitweb and lately added caching of projects list info; Lars Hjemli who is the author of cgit, git web interface in C which includes some caching. (BTW. I'd like to hear your thoughs on git web interface caching, and about solutions you have implemented). This is continuation of my thoughts about how to implement caching in gitweb, what problems we could encounter, and what existing solutions (what code/what packages) can we (re)use. One of the more important issues to think about when implementing caching is to decide when to regenerate cache, i.e. issues of cache (in)validation and lifetime. 1. Static cache, external refreshing (invalidation). The easiest situation is when cache can be invalidated (removed) externally; caching support in gitweb would then need only to either use cached information or cached output if it exists, and generate information and/or output and cache appropriate things if cache doesn't exist. In closed-up git hosting system like repo.or.cz new contents can appear in repository only via push (if repo is manually updated) or via automated fetch (if repo is mirrored automatically). This means that it is known when infomration about given repository gets stale (out of sync). It would be then enought to make 'update' or 'post-receive' hook to delete cache, or invalidate parts of cached info about given repository. Creation and deletion of repositories should also be handled by scripts; they affect caching too. This of course assumes that we can control repository hooks (perhaps git should learn hook multiplexing first, as proposed some time ago on a mailing list). This is not the case when developers are given shell access, and gitweb is offered as a part of service rather than as a part of git hosting; repositories are not under web administrator control. This is the case (according to J.H. on git mailing list) for kernel.org. So we have to examine also more generic solutions. 2. Checking filesystem (stat and/or inotify). If new objects come to repository via commit or via fetch it is enough (I guess) to watch for modifications of GIT_DIR of a project (I think due to doing atomic writes via "create temporary file, then rename it to final filename" of files in GIT_DIR: COMMIT_MSG and FETCH_HEAD). So it should be enough to check and compare stat info for GIT_DIR of a project, or of possible implement some inotify (or equivalent on other operating systems than Linux) checking, to see if cache can contain stale info. In practice what we can truly check is that nothing changed with repo. Unfortunately the above is not the case if objects come to repository via push. Note that both push resulting in crating a pack (this I think could use the same mechanism, only checking GIT_DIR/objects/pack directory), and push resulting in creation of loose objects has to be supported; additionally the refs pushed can have deeply hierarchical names. I would be grateful if somebody could think a way to check if anything could have changed for such situation... but as it is now we have to go to more complicated ways of cache invalidation. 3. Cache lifetime. Finally, for cases such as gitweb where validating cache (checking if the cached information isn't stale, out of sync with reality) is almost as costly as calculating the whole information without using cache at all, there is one possible solution to cache validation: simply keep cache for some time. For longer cache lifetimes gitweb perhaps should put some notification that information is from cache, perhaps with the time in human readable form how much time ago was this information generated (human readable means no "1325 seconds ago" info ;-). And if we want to be thorough, put it also in the HTTP header "Warning:" (at least for HTTP/1.1, see sections 13.1.2 and 14.46 of RFC 2616), e.g.: Warning: 110 git.kernel.org "Response is stale" The question is what timeout, or how to choose lifetime of a cache. J.H. kernel.org's gitweb tries to adjust cache lifetime to server load, making cache lifetime longer if server load is higher, but ensuring that cache lifetime stays within specified bounds. I have found among CPAN modules Cache::Adaptive where you can also specify bounds for expire time and subroutine to adjust cache lifetime, e.g. according to load average, process time for building the cache entry, etc. (it can use specified backend, for example Cache::FileCache from Cache::Cache distribution). Its subclass Cache::Adaptive::ByLoad which tries to adjust cache lifetime for bottlenecks under heavy load. Neither of modules I think is distributed as ready package in extras on trusted contrib packages repositories. Nevertheless we can "borrow" the algorithms used by those modules. We should also try to avoid 'thundering herd' problem, namely that cache expires, gitweb gets N requests before cache gets re-created, and [poorly designed] cache architecture makes all N do the work regenerating cache. There are several ideas of how to deal with this problem: * If (part of) cache has expired, set its expiration time to the current time plus specified duration (slop) needed to regenerate cache. It was used by original Pasky solution (and is used by further solutions for caching projects list sent here); in can be used by CHI (caching infrastructure) with busy_lock option... well, kind of. * Use some kind of locking so only one process does the work and updates the cache. From what I've briefly checked that is what kernel.org gitweb does (using flock()). The patch implementing projects list info caching does protect using O_EXCL on temporary/lock file against more than one process writing the cache, but doesn't protect against more than one process doing the work, unformtunately. * Allows items to expire a little earlier than the stated expiration time to help prevent cache miss stampedes. This is what CHI module does with expires_variance option. The probability of expiration increases as a function of how far along we are in the potential expiration window, with the probability being near 0 at the beginning of the window and approaching 1 at the end. If cache size becomes issue there will be additional complications like which entries (which cached values) to remove first when we go over the cache size limit; but lets us leave it for later, if it would be needed at all. %% In next part: CPAN packages we could use, or take inspiration from 1. Cache::Cache (standard) 2. CHI - Unified cache interface 3. Cache - the Cache interface 4. other interesting packages * Cache::Adaptive for adaptive cache lifetime solutions * Cache::Memcached and/or Cache::Swifty for caching using cache daemon * Cache::FastMmap (also example of callbacks), and caching benchmark mentioned there -- Jakub Narebski Poland -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html