[RFD] Gitweb caching, part 3: examining Perl modules for caching (long)

Jakub Narebski <jnareb@xxxxxxxxx> · Sat, 29 Mar 2008 18:13:26 +0100

[Hopefully this resend wouldn't be stopped by vger antispam filter]

In previous parts:

What to cache.
1. Support for caching in HTTP (external caching).
2. Caching Perl structures (and serialization).
3. Caching gitweb output: formatted pages.

Cache (in)validation and lifetime
1. Static cache, external refreshing (invalidation).
2. Checking filesystem (stat and/or inotify).
3. Cache lifetime (timing-out cached info).
4. LRU (Least Recently Used) and others <- to be written

In this part I will write about existing caching solutions, or to be
more exact about CPAN packages implementing caching or cache interface
in Perl.

Note that not on all sites one can install packages from CPAN; only
packages from main packages repository, from extras repository, or
sometimes from trusted contrib repository are possible (see J.H. post
about this problem).  From mentioned packages only Cache::Cache and
Cache::Mmap are available in contrib repositories for Aurox 11.1
distribution (based on Fedora Core 4): Cache::Cache distribution as
perl-Cache-Cache in Dries RPM Repository, and Cache::Mmap as
perl-Cache-Mmap in both Dries and Dag Wieers RPM repositories.

1. Cache::Cache (standard)

Implements Cache::MemoryCache, Cache::SharedMemoryCache (using
IPC::ShareLite), Cache::FileCache and Cache::SizeAwareFileCache.  This
is the standard: various other modules often say that they implement
Cache::Cache interface.  

It shows a bit its age, so various improvements exists, including CHI
- unified cache interface (which can use Cache::Cache modules, and
also other caching backends like Cache::FastMmap in a unified way),
and Cache - the cache interface, which tries to improve Cache::Cache
but is not yet complete.

Here is some sample code for instantiating and using file system based
cache (it uses Storable for serialization, IIRC).

  use Cache::FileCache;

  my $cache = new Cache::FileCache({default_expires_in => "15 minutes"});

  my $customer = $cache->get($name);

  if (not defined $customer)  {
    $customer = get_customer_from_db($name);
    $cache->set($name, $customer, "10 minutes");
  }

  return $customer;

Cache::Cache distribution can be found (at least on RPM based Linux
distributions) in perl-Cache-Cache package (e.g. in Dries RPM
repository).

Various other modules implements Cache::Cache interface, for example
Cache::BerkeleyDB (compare with Cache::BDB).

2. CHI - Unified cache interface

CHI provides a unified caching API.  The CHI interface is implemented
by driver classes that support fetching, storing and clearing of data.

CHI is intended as an evolution (and successor) of DeWitt Clinton's
Cache::Cache package, adhering to the basic Cache API but adding new
features and addressing limitations in the Cache::Cache implementation.

Main goals of CHI were performance (minimizing method calls,
serializing data only when necessary) and making the creation of new
drivers as easy as possible.  

The latter had lead to wrapping most popular caches available on CPAN in
CHI interface (CHI handles serialization and expiration times) with
CHI::Driver::CacheCache, CHI::Driver::FastMmap, CHI::Driver::Memcached.
"Native" CHI drivers include 'File' (one file per entry), 'Memory'
(per-process) and 'Multilevel' (two or more CHI caches, e.g. memcached
bolstered by a local memory cache).  'DBI' and 'BerkeleyDB' are
planned...

CHI provides expire_if [CODEREF] for additional check if cache expired,
busy_lock [DURATION] to set expiration time to current time plus
specified duration if value has expired to prevent "cache stampede", and
expires_variance [FLOAT] to allow items to expire little earlier to
prevent cache miss stampedes (favored over busy_lock).  Even if gitweb
wouldn't use CHI interface directly, those ideas are worth considering.

In addition to standard get() and set() methods it implements
compute() method which combines get and set operations in a single
call.  It also has some methods to process multiple keys and/or values
at once.

Here is some sample code for instantiating and using file system based
cache.

    use CHI;

    # Choose a standard driver
    #
    my $cache = CHI->new(driver => 'File', 
                         cache_root => '/tmp/cache');

    # Basic cache operations
    #
    my $customer = $cache->get($name);
    if (!defined $customer) {
        $customer = get_customer_from_db($name);
        $cache->set($name, $customer, "10 minutes");
    }

    # or simply
    my $customer = $cache->compute($name, \&get_customer_from_db,
                                   "10 minutes");

3. Cache - the Cache interface 

The Cache modules are a total redesign and reimplementation of Cache::Cache
and thus not directly compatible.  Contrary to Cache::Cache get() and
set() methods do not serialize complex data types; you have to freeze()
and thaw() data explicitely, instead of set/get.  You can get IO::Handle
by which data can be read from, or written to cache, e.g. when using
Cache::File.  There is no concept of 'namespace' in the basic cache
interface.  Purging is done automatically in current implementation.

Currently only Cache::File (filesystem based implementation, could be
done more efficiently, currently supports only LOCK_NFS locking) and
Cache::Memory (per-process memory based implemetation; with namespaces)
drivers are implemented.

In Cache modules one can select removal strategy for the cache.  By
default FIFO (First In First Out: remove oldest) and LRU (Least Recently
Used: remove stalest) strategies are available (when cache has size
limit?).

Cache modules provide callback interface: load_callback to be called
anytime when a get() is issued for a data that does not exist in cache,
and validate_callback (for example storing and checking timestamp or
similar).  This means that sample code for instantiating and using file
system based cache can be written as below.

  use Cache::File;

  my $cache = Cache::File->new(cache_root => '/tmp/cacheroot');
  $cache->set_load_callback(\&get_customer_from_db);

  # calls get_customer_from_db() if needed
  my $customer = $cache->get($name);

The Cache classes can be used via the tie interface, as shown below.
This allows the cache to be accessed via a hash.  All the standard
methods for accessing the hash are supported, with the exception
of the 'keys' or 'each' call.

  tie %hash, 'Cache::File', { cache_root => $tempdir };

  $hash{'key'} = 'some data';
  $data = $hash{'key'};

The tie interface is especially useful with the load_callback to
automatically populate the hash.

Even if gitweb wouldn't use Cache modules (perhaps because lack of
matority, or/and the fact that they ar not in extras or trusted
contrib packages repository) the idea of selectable removal strategy
and the idea of callback interface are worth considering; perhaps even
tie interface.  Whether to serialize explicitely or not... that is
also to be decided.

4. Other interesting caching packages

4.1. Cache::Adaptive for adaptive cache lifetime control

Cache::Adaptive is a cache engine with adaptive lifetime control.  Cache
lifetimes can be increased or decreased by any factor, e.g. load
average, process time for building the cache entry, etc.  Can use almost
any Cache::Cache object as backend (the update algorithm needs reliable
set() method, so Cache::SizeAwareFileCache cannot be used).

Cache::Adaptive::ByLoad is a subclass of Cache::Adaptive, which adjusts
cache lifetime by two factors; the load average of the platform and the
percentage of the total time spent by the builder.

Cache::Adaptive Introduces additional 
  access({ key => cache_key, builder => sub { ... } })
method, which returns cached entry if possible, or builds the entry by
calling the builder function, and optionally stores the build entry to
cache.  Compare with compute() method from CHI, or callback interfaces.

Worth examining (both interface and implementation) if/when implementing
cache lifetime control based on load average, like kernel.org gitweb
tries to do.  J.H. (kernel.org) fork of gitweb uses longer lifetime
under heavier load (within specified bounds).

4.2. Cache::Memcached and/or Cache::Swifty for caching using cache daemon

For larger installations, when there is needed caching not only for
gitweb, it might be worth examining cache daemon solutions, like
memcached (and Cache::Memcached, Cache::Memcached::Fast or CHI
equivalent), a distributed memory cache daemon; or swifty (and
Cache::Swifty), a very fast shared memory cache, in early alpha stages.

The Cache::Memcached api, besides set/get methods and administrative
methods provide add() and replace() methods to set() conditionally, if
value doesn't exists, or does exists in the cache.

Memcached was created to reduce load on high-trafic site with a hight
database load that contains mostly read threads, so it might be not
approproate for gitweb, where I/O load is of most concern, not CPU.
The main advantage of memcached is its ability to scale out.  Usually
you can run memcached together with web serever or database server, as
memcached is CPU lean and memory hungry, and web/database server the
reverse: CPU hungry and usually memory lean.  Note that gitweb (or
rather git access to repositories in gitweb) is I/O hungry.

4.3. Cache::FastMmap (also example of callbacks),
     and caching benchmark mentioned there

Cache::FastMmap uses an mmap'ed file to act as a shared memory
interprocess cache.  It uses fcntl locking to ensure multiple processes
can safely access the cache at the same time. It uses a basic LRU
algorithm to keep the most used entries in the cache, plus (optionally)
cache timeout.

Cache::FastMmap was created to be very fast.

The class also supports read-through, and write-back or write-through
callbacks to access the real data if it's not in the cache.  With those
the code to deal with cache can be written simply as

  Cache::FastMmap->new(
    ...
    context => $RealDataSourceHandle,
    read_cb  => sub { $_[0]->get($_[1]) },
    write_cb => sub { $_[0]->set($_[1], $_[2]) },
  );

  ...

  my $value = $cache->get($key);

  $cache->set($key, $newvalue);

It also supports get_and_set() subroutine to atomically retrieve and set
value of given key, and has methods dealing with multiple keys at once.
There is also Cache::FastMmap::Tie module to use tie interface to
Cache::FastMap.  Even if gitweb wouldn't use this module, the callback
based interface is worth considering to implement.

4.4. CGI::Cache to help cache output of time-intensive CGI scripts
     with minimal changes to CGI script code.

This module is intended to be used in CGI scripts that may benefit from
caching; it is written in such a way that existing CGI code could get
caching added with minimal changes to script.  Here's a simple example:

  #!/usr/bin/perl

  use CGI;
  use CGI::Cache;

  # Set up cache
  CGI::Cache::setup();

  my $cgi = new CGI;

  # CGI::Vars requires CGI version 2.50 or better
  CGI::Cache::set_key($cgi->Vars);

  # This should short-circuit the rest of the loop if a cache value is
  # already there
  CGI::Cache::start() or exit;

  print $cgi->header, "\n";

  #...

  print <<EOF;
  This prints to STDOUT, which will be cached.
  If the next visit is within 24 hours, the cached STDOUT
  will be served instead of executing this 'print'.
  EOF

CGI::Cache module ties the output file descriptor (usually STDOUT) to an
internal variable to which all output is saved.  This trick (technique)
is worth considering if we decide on caching final output in gitweb, or
final output without HTTP headers.

5. Summary

If it is decided that gitweb would do caching of Perl structures, we
would certainly use Storable, which should be installed as part of Perl
installation on most systems.  Perhaps gitweb could use Cache::Cache
packages in general (and Cache::FileCache in particular), as it should
fill "extras or trusted contrib" criterion, but I'd rather not add
another dependency to gitweb, especially that not all installations need
caching.  It could be good solution for gitweb fork, and I guess
kernel.org gitweb could use it.

If gitweb is to implement it's own solutions to not introduce extra
dependencies, and it would cache Perl structures, implementing
Cache::Cache get/set interface, with possible improvements of callback
interface would be a good idea.  For very large installations it would
be good to check memcached solution (or multilevel cache, see CHI).

If gitweb is to use caching of output, or output without HTTP headers,
either using CGI::Cache or using its technique would be a good idea.

Gitweb caching is meant to reduce load (mainly I/O load according to
some mails send on this mailing list by J.H., kernel.org gitweb admin,
and Pasky, repo.or.cz gitweb admin).  I think it would be good to try
and check, benchmarking if possible, different solutions to "thundering
horde" aka "cache stampede" problem, and to adaptive cache lifetime
control (see Cache::Adaptive).

Thoughts? Comments?

%%
In the next part I'd like to have thoughts and ideas for gitweb caching
from J.H. and Petr 'Pasky' Baudis...

References:
===========
[1] http://search.cpan.org
[2] http://code.google.com/p/perl-cache
[3] http://www.danga.com/memcached/
[4] http://cpan.robm.fastmail.fm/cache_perf.html

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html