The attached patch adds documentation for CacheFS. Signed-Off-By: David Howells <dhowells@xxxxxxxxxx> --- warthog>diffstat -p1 fscache-cachefs-docs-2614mm2.diff Documentation/filesystems/caching/cachefs.txt | 375 ++++++++++++++++++++++++++ 1 files changed, 375 insertions(+) diff -uNrp linux-2.6.14-mm2/Documentation/filesystems/caching/cachefs.txt linux-2.6.14-mm2-cachefs/Documentation/filesystems/caching/cachefs.txt --- linux-2.6.14-mm2/Documentation/filesystems/caching/cachefs.txt 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.14-mm2-cachefs/Documentation/filesystems/caching/cachefs.txt 2005-11-14 16:23:38.000000000 +0000 @@ -0,0 +1,375 @@ + =========================== + CacheFS: Caching Filesystem + =========================== + +======== +OVERVIEW +======== + +CacheFS is a backend for the general filesystem cache facility. + +CacheFS uses a block device directly rather than a bunch of files under an +already mounted filesystem. For why this is so, see further on. If necessary, +however, a file can be loopback mounted as a cache. + +CacheFS is based on a wandering tree approach. This means that data already on +the disk are not changed (more or less), only replaced. This means that +CacheFS provides both metadata integrity and data integrity. There is a small, +simple journal that tracks the state of the tree and the block allocation +management. Should the power be cut to a computer, or should it crash, all +changes made to the cache since the last time the journal was cranked will be +lost; but a valid tree will remain, albeit slightly out of date. + + +======== +MOUNTING +======== + +Since CacheFS is actually a quasi-filesystem, it requires a block device behind +it. The way to give it one is to mount it as cachefs type on a directory +somewhere. The mounted filesystem will then present the user with a single file +describing the current cache management status. + +There are a number of mount options that can be provided when the cache is +mounted: + + (*) -o tag=<name> + + This tells FS-Cache the name by which netfs's will refer to the cache. + This is not strictly a necessity; if it's not given, a tag will be + invented based on the major and minor numbers of the block device. If the + netfs doesn't give FS-Cache any specific instructions, the first cache in + the list will be picked by default. + + (*) -o wander=<n> + + Set the wander timer so that CacheFS will commit the journal that long + after a change is made if nothing else causes the tree to wander. + + n may be in the range 0 to 3600. If n is 0 then automatic wandering will + be disabled, otherwise it's a number of seconds. The tree is also forced + to wander by allocator underrun, sync and unmounting the cache. + + A smaller number means that the cache will be more up to date if the power + fails, but that the allocator will cycle faster and blocks will be + replaced more often, lowering performance. + + (*) -o autodel + + All files should be deleted when the last reference to them is dropped. + This is primarily for debugging purposes. + +For instance, the cache might by mounted thusly: + + root>mount -t cachefs /dev/hdg9 /cache-hdg9 -o tag=mycache + root>ls -1 /cache-hdg9 + status + +However, a block device that's going to be used for a cache must be prepared +before it can be mounted initially. This is done very simply by: + + echo "cachefs___" >/dev/hdg9 + +During the initial mount, the basic structure will be written into the cache +and then the journal will be replayed as during a normal mount. + +Note that trying to mount a cache read only will result in an error. + + +============================================= +WHY A BLOCK DEVICE? WHY NOT A BUNCH OF FILES? +============================================= + +CacheFS is backed by a block device rather than being backed by a bunch of +files on a filesystem. This confers several advantages: + + (1) Performance. + + Going directly to a block device means that we can DMA directly to/from + the the netfs's pages. If another filesystem was managing the backing + store, everything would have to be copied between pages. Whilst DirectIO + does exist, it doesn't appear easy to make use of in this situation. + + New address space or file operations could be added to make it possible to + persuade a backing diskfs to generate block I/O directly to/from disk + blocks under its control, but that then means the diskfs has to keep track + of I/O requests to pages not under its control. + + Furthermore, we only have to do one lot of readahead calculations, not + two; in the diskfs backing case, the netfs would do one and the diskfs + would also do one. + + (2) Memory. + + Using a block device means that we have a lower memory usage - all data + pages belong to the netfs we're backing. If we used a filesystem, we would + have twice as many pages at certain points - one from the netfs and one + from the backing diskfs. In the backing diskfs model, under situations of + memory pressure, we'd have to allocate or keep around a diskfs page to be + able to write out a netfs page; or else we'd need to be able to punch a + hole in the backing file. + + Furthermore, whilst we have to keep a certain amount of memory around for + every netfs inode we're backing, a backing diskfs would have to keep the + inode, dentry and possibly a file struct, in addition to FS-specific + stuff, thus adding to the burden. + + (3) Holes. + + The cache uses holes in files to indicate to the netfs that it hasn't yet + downloaded the data for that page. + + Since CacheFS is its own filesystem, it can support holes in files + trivially. Running on top of another diskfs would limit us to using ones + that can support holes. + + Furthermore, it would have to be made possible to detect holes in a diskfs + file, rather than just seeing zero filled blocks. + + (4) Integrity + + CacheFS maintains filesystem integrity through its use of a wandering + tree. It (for the most part) replaces blocks that need updating rather + than overwriting them in place. That said, certain non-structural changes + - such as the updating of atimes - are done in place. + + CacheFS gets data integrity for free - more or less - by treating the + data exactly as it treats the metadata. Data blocks that need changing + are simply replaced. Whilst this does mean that the meta data pointing to + it also needs updating, quite often these changes elide between journal + updates. + + Knowing that your cache is in a good state is vitally important if you, + say, put /usr on AFS. Some organisations put everything barring /etc, + /sbin, /lib and /var on AFS and have an enormous cache on every + computer. Imagine if the power goes out and renders every cache + inconsistent, requiring all the computers to re-initialise their caches + when the power comes back on... + + (5) Disk Space. + + Whilst the block device does set a hard ceiling on the amount of space + available, CacheFS can guarantee that all that space will be available to + the cache. On a diskfs-backed cache, the administrator would probably want + to set a cache size limit, but the system wouldn't be able guarantee that + all that space would be available to the cache - not unless that cache was + on a partition of its own. + + Furthermore, with a diskfs-backed cache, if the recycler starts to reclaim + cache files to make space, the freed blocks may just be eaten directly by + userspace programs, potentially resulting in the entire cache being + consumed. Alternatively, netfs operations may end up being held up because + the cache can't get blocks on which to store the data. + + (6) Users. + + Users can't so easily go into CacheFS and run amok. The worst they can do + is cause bits of the cache to be recycled early. With a diskfs-backed + cache, they can do all sorts of bad things to the files belonging to the + cache, and they can do this quite by accident. + + +On the other hand, there would be some advantages to using a file-based cache +rather than a blockdev-based cache: + + (1) Having to copy to a diskfs's page would mean that a netfs could just make + the copy and then assume its own page is ready to go. + + (2) Backing onto a diskfs wouldn't require a committed block device. You would + just nominate a directory and go from there. With CacheFS you have to + repartition or install an extra drive to make use of it in an existing + system (though the loopback device offers a way out). + + (3) You could easily make your cache bigger if the diskfs has plenty of space, + you could even go across multiple mountpoints. This last isn't so much of + a problem as you can have multiple caches. + + +====================== +CACHEFS ON-DISK LAYOUT +====================== + +The filesystem is divided into a number of parts: + + 0 +---------------------------+ + | Superblock | + 1 +---------------------------+ + | Journal | + 17 +---------------------------+ + | | + | Data | + | | + END +---------------------------+ + +The superblock contains the filesystem ID tags and pointers to all the other +regions. All blocks are PAGE_SIZE in size and the blocks are numbered, starting +with the superblock as 0. Using 32-bit block pointers, a maximum number of +0xffffffff blocks can be accessed, meaning that the maximum cache size is ~16TB +for 4KB pages. + +CachefS will use the endianness and block size details from the kernel that +created a cache, and it will not permit the cache to be mounted if these +details differ from what was written on disk. + +The journal consists of a set of entries of sector size that keep track of the +current root block of the tree and the various recycling and allocation lists. + +The journal scanned on mounting to find the most recent fully committed tree +root, and that will be used. Any changes that were made but not connected to +the tree rooted in the latest journal entry will be lost. + +The data region holds a number of things: + + (1) The Metadata Tree + + As previously mentioned, CacheFS is tree based. The journal points to the + current committed root of the tree. The structure of this tree is + discussed below: + + (2) Data Blocks + + These are fragments of files that are cached on the behalf of network + filesystems. + + (4) Allocation, Recycling and Reclamation Stacks and Free Blocks + + The free blocks of the filesystem are kept in either the two allocation + stacks if they're laundered and ready to be used, or the reclamation + stack if they'll be ready once the journal has ticked over. Note that + stacks are used in order that committed stack nodes don't have to be + changed - we can just add another block on the front and change the stack + top pointer in the journal. + + There are also two stacks associated with recycling trees of data blocks + from deleted nodes. These are processed in the background by kcachefsd + and their components all get transferred to the reclamation stacks and + thence to the allocation stacks. + + +============================ +CACHEFS METADATA NODE LAYOUT +============================ + +The CacheFS metadata tree has its layout based around the filesystem block size +(PAGE_SIZE) and the sector size of the underlying device (512 bytes normally). + +Each "node" in the tree is mapped onto a single block and contains a number of +slots of sector size, aligned on sector boundaries: + + +-------+----------+ + | | SLOT 0 | + | +----------+ + | | SLOT 1 | + | +----------+ + | | SLOT 2 | + | +----------+ + | | SLOT 3 | + | NODE +----------+ + | | SLOT 4 | + | +----------+ + | | SLOT 5 | + | +----------+ + | | SLOT 6 | + | +----------+ + | | SLOT 7 | + +-------+----------+ + +Each slot can either be empty or it can hold a "leaf". There are a number of +types of leaves: + + (1) Index Object Leaf. + (2) Data File Object Leaf. + (3) Other Object Leaf. + + These three all look exactly the same on disk. They have the following + attributes: + + - The object type. + - A unique object ID. + - The parent's object ID. + - A netfs key and key length. + - A digest of the netfs key, parent object ID and netfs key length. + - Netfs auxilliary data. + - The inode maximum size. + - The last time this object was accessed. + - The netfs's name for the class of this object. + - A tree of data pages, the depth of that tree and the number of blocks + it contains. + + In general, any type of object can have data or child objects; however, + indexes aren't permitted data and non-indexes aren't permitted indexes as + children. + + (4) Pointer Leaf. + + This is simply a leaf that is entirely given over to pointers to other + blocks (it can also contain null pointers). + + (5) Shortcut Leaf. + + This is a leaf that permits a chunk of keyspace to be skipped, allowing + the path through the tree to be shortened in some extreme cases. + +Note that pointer leaves can be distinguished from other leaf types by the +second pointer slot in the leaf. If this points into the journal, then it +actually indicates the type of one of the other types of leaf. + + +=============================== +CACHEFS METADATA TREE STRUCTURE +=============================== + +The CacheFS metadata tree is navigated by rigidly partitioned key space. For a +4KB page size, each step along the path of the tree consumes 10 bits of the key +(a "subkey"), assuming bit 0 of byte 0 to be the first bit of the key: + + LEVEL 0 LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 + + +-------+ +-------+ +-------+ +-------+ +-------+ + | PTR |------>| PTR |------>| PTR |------>| PTR |------>| LEAF | + | LEAF | | | | | | PTR |---+ | LEAF | + | PTR | | | | | | | | | LEAF | + +-------+ +-------+ +-------+ +-------+ | +-------+ + | + | +-------+ + +-->| LEAF | + | | + | | + +-------+ + +Whilst this would seem to be horribly inefficient, there are a number of +optimisations that help to make the scheme much more efficient: + + (1) No path is longer than it has to be. A node can hold more than one leaf, + so we don't bother fanning out a node that isn't full to overflowing. + + +-------+ +-------+ +-------+ +-------+ +-------+ + | PTR |------>| PTR |------>| PTR |------>| PTR |------>| LEAF | + | LEAF | | | | | | LEAF | | LEAF | + | PTR | | | | | | | | LEAF | + +-------+ +-------+ +-------+ +-------+ +-------+ + + (2) If a path to the point at which a number of nodes can be distinguished is + made up of a line of nodes, each of which contains one pointer, then part + of the path can be made up of a shortcut leaf pointing to a node. The + shortcut represents several adjacent nodes. + + +-------+ +-------+ +-------+ + | SHORT |------>| PTR |------>| LEAF | + | LEAF | | LEAF | | LEAF | + | PTR | | | | LEAF | + +-------+ +-------+ +-------+ + + (3) With a digest function that produces a reasonably even distibution based + on the key set presented, you get, for the most part, the shortest paths + everywhere. + + (4) With a digest function that produces a certain amount of clumping bits of + the tree wind up staying in memory longer because they're referred to more + often. Also with a certain amount of clumping the tree ends up being less + sparse and thus occupies less disk space. + + (5) It is assumed that if a node has a non-null pointer in a pointer leaf at + the location indexed by the subkey for that level of the key you're + looking for, then the leaf must lie behind that pointer, if it exists. + Otherwise you just have to look in the current node and no further.