Re: [Linux-cachefs] Re: NFS Patch for FSCache

David Howells <dhowells@xxxxxxxxxx> · Fri, 13 May 2005 12:17:27 +0100

Charles Lever <Charles.Lever@xxxxxxxxxx> wrote:

> to benchmark this i think you need to explore the architectural
> weaknesses of your approach.  how bad will it get using cachefs with
> badly designed applications or client/server setups?

There are a number of critical points:

 (1) Inodes

     I need to represent in the cache the files I have cached so that I can
     find the data attached to them and keep track of the last access times
     for the purposes of culling inodes to make space. There are several ways
     of doing this:

	(a) Set aside a portion of the disk as an inode table and search it
	    end to end for each open attempt. Now assume I store one inode per
	    sector (512 bytes) and have a table of 128*1024 entries; this
	    means the worst case is that I have to read through 64MB (16384
	    pages) - and the worst case is going to happen every time there's
	    a cache miss. Now this can be alleviated in a couple of ways:

	    - By holding a bitmap, say, indicating which page-sized blocks
	      actually have inodes in them, but that then precludes the use of
	      do_generic_mapping_read() and readahead.

	    - By simply keeping track of where the last used block is to cut
              short the scan and by moving inodes down or by always allocating
              as low a slot as possible.

	    The really unpleasant side of this is that it will cycle on
	    average at least half of this space through the page cache every
	    time we scan the table. The average is likely to be worse than
	    half if cache misses are taken into account.

	    This also cuts out a chunk of the cache and makes it permanently
	    unavailable.

	(b) Store metadata in an extensible file. This is what cachefs
	    actually does. It's slower than (a), but does allow (almost)
	    unlimited growth of the inode table and only uses as much of the
	    space in the cache as is necessary. Since I keep a list of free
	    inode entries, allocating a new one is very quick.

	    However, the worst case performance for (b) is actually worse than
	    for (a) because I not only have to read through the blocks of
	    inode definitions, but I also have to read the pointer blocks, and
	    there's on average one of those per 1024 inodes. Even worse is
	    that reading a particular inode has to be synchronous with respect
	    to walking the indirection chain.

	    This also has the same unpleasant side effects on scanning the
	    table as (a), only more so.

	(c) Store the metadata records in a tree. This would permit
	    predictable and potentially constant time lookup for a particular
	    inode, though we would have to walk the tree to find the
	    particular inode we're looking for, which has to be synchronous.

	    Scanning a tree is potentially even worse than for flat files like
	    in (a) and (b) since you potentially have a lot more intermediate
	    nodes to walk. However, the really big advantage of a tree is that
	    it's a lot easier to remove dead space, thus compressing the tree,
	    plus it only needs to eat up as much cache space as is necessary.

	    Trees are a lot harder to juggle with the type of journalling that
	    cachefs does now, however, since the worst case for fanning out
	    any particular node is that you have to allocate as many new nodes
	    as there are slots in the page you already have plus one. So if
	    you make slots sector size, that's nine pages on a 4K page system,
	    but 129 on a 64K page system, and you have to journal all this
	    data...

	    Ideally, for optimal lookup speed, you want your tree to be
	    balanced, and that means dealing with rotation through nodes with
	    more than two pointers. This is something else the current method
	    of journalling is really poor at coping with.

	    However, the use of a wandering tree rather than exquisite
	    journalling would make matters easier here; the only journalling
	    required would be the change of tree root and the change in state
	    of the free block lists. The downside of a wandering tree is that
	    you have to maintain sufficient free space to be able to wander in
	    the performance of a deletion to make more space.

 (2) Indexing

     When a file is opened, CacheFS has to look in the cache to see if that
     file is represented there. This means searching the set of cached files
     for the one in which we're particularly interested.

     Now I could store the lookup keys in the inodes themselves, but this has
     two important consequences:

	(a) It restricts the amount of auxilliary data an inode can store;
	    this includes such things as direct data pointers.

	(b) It restricts the size of a netfs key and auxilliary data that can
	    be stored in an inode without increasing the on-disk inode
	    representation.

     Furthermore, the keys of a particular class of object from a particular
     netfs are generally of the same sort of size. For instance AFS vnodes
     require a vnode ID (4 bytes) and a vnode uniquifier (4 bytes) per file.
     Nothing else. These can be packed much more closely than can inodes
     making them that much faster to search.

     Therefore, I chose to arrange cachefs as a set of homogenous indexes,
     where each index is defined by the netfs to hold elements of a particular
     size. This makes searching an index that much faster.

     However, since it's an index tree that's defined on disk, open() is still
     synchronous in that it has to walk down the index tree until it finds (or
     not) the particular data file it's looking for.

     So the rapid opening and closing of a lot of small files is going to be
     penalised; though this is reduced by the fact that we cache information
     about indexes we know about. For instance, in the NFS client caching, we
     have two layers of indexes: a server index (pinned by NFS server
     records), each entry of which points to another index that keeps track of
     the NFS files we know about by the NFS file handle.

     Unfortunately, NFS file handles are potentially quite large (128 bytes
     maximum for NFS4). This means that we can't fit many per page (about 30
     on a 4K page). Now imagine that we're looking at a 2.6.11 kernel source
     tree on an NFS server; this has about 21000 files. This means that we
     require about 700 pages at least, more if the NFS filesystem wants to
     store auxilliary data as well. So the worst case lookup is reading
     through 700 pages (cache miss is worst case). If the key was stored in
     with the inodes, this would mean at least 2625 pages to be read (assuming
     8 per page). Now using do_generic_mapping_read() alleviates some of the
     latency associated with this by doing readahead, but it's still quite a
     turnover of the page cache.

     If NFS had just the one index for all files on all servers, then the keys
     would be bigger, though maybe only by 4 bytes (IP address), permitting
     say 29 per page. Now assume you've got two servers, each with a kernel
     source tree; that brings the number of files to 42000, and a worst case
     lookup of say 1448 pages in the index or 5250 in the inode table
     directly.

     As you can see, keeping your indexes small greatly helps reduce the
     lookup times, provided you can keep information about the indexes pinned
     in memory to held you get to the bottom faster.

     However, having more levels of index and subdividing the key space
     between them brings its own pitfall: there are more levels, and walking
     them has to be synchronous. Not only that, but creating all these indexes
     on disk also has to be synchronous; and worse, has to be journalled.

     Now all this is for flat-file indexes, as cachefs currently uses. If,
     instead, cachefs used a tree the worst case lookup time would be (for a
     balanced tree):

	round_up(log1024(#objects)) * step_time

     For 4K pages. Not only that, but the maximum page cache thrash would be
     round_up(log1024(#objects)) too. So if you've got a million objects, then
     this would be 2.

 (3) Data storage

     Finally, once you've found your inode, you have to be able to retrieve
     data from it. cachefs does this by having a small number of direct
     pointers in the inode itself, plus a single-indirect pointer, plus a
     double indirect pointer, and plus potentionally higher-order indirection
     pointers.

     So for cachefs as it stands, up to the first 112 pages (448KB if 4K
     pages) of a file are pointed to directly. These pages are really fast to
     access because the inodes are held in memory as long as the fscache
     cookies persist. For further into a file peformance degrades as more and
     more levels of indirection blocks have to be traversed; but I think
     that's acceptable. The single indirection block covers the next 4MB of
     the file and then the double indirection block covers the next 4GB. We'd
     then have to use triple indirection for the following 4TB and so on.

     One disadvantage of doing this is that walking the indirection chain is,
     of course, synchronous; though the page cache will alleviate the problem
     somewhat. However, worse is that we have to journal the allocations of
     pages and pointer blocks.

     There are some ways of improving things:

	(a) Extents

	    When the netfs presents some pages for caching, if these are all
	    contiguous in the inode's page space then they could be
	    represented on disk as a start page index, a size and either a
	    list of blocks or the first block of a contiguous chunk of disk.

	    Extents are quite tricky to deal with, depending on the degree of
	    data integrity you want. How do you deal with overlapping extents?
	    Do you just overwrite the first extent with the data from the
	    second and then write an abbreviated second extent (and perhaps
	    append as much of the second onto the first if possible)? What if
	    you're guaranteeing not to overwrite any data?

	    Also you really need a flexible tree structure to manage extents,
	    I think.

	    On-disk contiguous extents are also impractical as it's unlikely
	    that you'd be able to scrape together large runs of adjacent
	    blocks easily.

	(b) Larger grained allocations

	    Performance could be improved at the cost of lowering the
	    utilisation of the disk by allocating blocks in groups. Pointers
	    in pointer blocks would then become a tuple containing a pointer
	    to the block group and a bitmap of the usage of the blocks in that
	    group - we mustn't return uninitialised data to the netfs. This
	    would allow one tuple to address up to 32 blocks (assuming a
	    32-bit bitmap); a coverage of 128KB with 4KB pages and 2MB with
	    64KB pages. However, the chunking factor could be anything from 2
	    to 32; a factor of 4 (16KB) might be a good compromise.

	    It would also be possible to create power-of-2 gradated allocation
	    bands in the cache, and attempt to allocate an appropriately sized
	    chunk, depending on the relationship between the target page group
	    position in the inode and the size of the inode's data.

	    This, however, would complicate the free block list handling as
	    each band would require its own free list list maintenance

	    This sort of thing would also easy to provide mount-time tuning
	    for.

 (4) Other objects

     One thing cachefs doesn't supply facilities for is caching other types of
     objects, such as extended attributes. This is tricky with the flat file
     indexing used as there's nowhere really to store a list of the objects
     required. A move to a tree based filesystem would make this possible.

> for instance, what happens when the client's cache disk is much slower
> than the server (high performance RAID with high speed networking)?

Then using a local cache won't help you, no matter how hard it tries, except
in the following circumstances:

 (1) The server is not available.

 (2) The network is heavily used by more than just one machine.

 (3) The server is very busy.

> what happens when the client's cache disk fills up so the disk cache is
> constantly turning over (which files are kicked out of your backing
> cachefs to make room for new data)?

I want that to be based on an LRU approach, using last access times. Inodes
pinned by being open can't be disposed of and neither can inodes pinned by
being marked so; but anything else is fair game for culling.

The cache has to be scanned occasionally to build up a list of inodes that are
candidates for being culled, and I think a certain amount of space must be
kept available to satisfy allocation requests; therefore the culling needs
thresholds.

Unfortunately, culling is going to be slower than allocation in general
because we always know where we're going to allocate, but we have to search
for something to get the chop.

> what happens with multi-threaded I/O-bound applications when the cachefs is
> on a single spindle?

I don't think that multi-threaded applications are actually distinguishable on
Linux from several single-threaded applications.

Allocation has to be a serialised if proper filesystem integrity is to be
maintained and if there is to be proper error handling. I do, however, try and
keep the critical section as small and as fast as possible.

> is there any performance dependency on the size of the backing cachefs?

The blocks holding virtually contiguous data may be scattered further apart on
the disk. This is unfortunate, but unless I can reclaim specific blocks or
use larger grained allocations there's not a lot I can do about that.

> do you also cache directory contents on disk?

That's entirely up to the netfs. AFS, yes; NFS, no. If cachefs were extended
to support the caching of other types of object, this would become easier for
NFS.

> remember that the application you designed this for (preserving cache
> contents across client reboots) is only one way this will be used.  some
> of us would like to use this facility to provide a high-performance
> local cache larger than the client's RAM.  :^)

Agreed. I'd like to be able to disable the journal. I'd also like to be able
to use it for low priority swap space. Obviously swap space can't be evicted
from the cache without approval from the swapper.

> synchronous file system metadata management is the bane of every cachefs
> implementation i know about.

Yeah. But imagine you work for a company with /usr mounted over the network by
every machine. Then the power fails. When the power comes back, all these
machines wake up, see their cache is in a bad state, reinitialise it and
immediately splat the server trying to regain /usr.

> have you measured what performance impact there is when cache files go from
> no indirection to single indirect blocks, or from single to double
> indirection?  have you measured how expensive it is to reuse a single cache
> file because the cachefs file system is already full?  how expensive is it
> to invalidate the data in the cache (say, if some other client changes a
> file you already have cached in your cachefs)?

Not yet. All these things require a journal update, but the changes don't
actually happen immediately, and we don't actually spend that much time
waiting around, except when we need to read something from the disk first.

The journal manager makes sure that the altered blocks hit the disk after the
journal does, and that happens in a background thread. Changes are batched up,
and we can write up to about 4000 in a batch (we must not overwrite the
previous BATCH and ACK marks in the journal).

> what about using an extent-based file system for the backing cachefs?
> that would probably not be too difficult because you have a good
> prediction already of how large the file will be (just look at the file
> size on the server).

Extents make deletion a lot slower, I suspect, because the structure is a lot
more flexible. Extents also do not eliminate indirection; they merely move it
elsewhere - an extent tree is indirect.

I'm working on making cachefs wandering tree based at the moment, at least for
the indexing. I've considered having data pointer blocks (extents) as objects
in the tree, but it's actually more complicated that way:-/

> how about using smallish chunks, like the AFS cache manager, to avoid
> indirection entirely?

In what way does this help? You have to, for example, be able to cope with a
64-bit machine requesting pages from anywhere within a file, so you might get
a request for page 0xFFFFFFFFFFFFFFFF from a sparse file and have to be able
to deal with it. How are you going to handle that without indirection? CacheFS
at the moment can't deal with that, but it wouldn't be too hard to make it
possible, it merely requires 7th-level indirection (I think). And if I move to
64-bit block pointers at some point, you'd be able to store almost that many
pages, assuming you could find a block device big enough.

> would there be any performance advantage to caching small files in memory
> and large files on disk, or vice versa?

Well, you have the page cache; but how do you decide what should be kept in
memory and what should be committed to disk? If you keep small files in
memory, then you lose them if the power goes off or you reboot, and if you're
trying to operate disconnectedly, then you've got a problem.

David