Charles Lever <Charles.Lever@xxxxxxxxxx> wrote: > to benchmark this i think you need to explore the architectural > weaknesses of your approach. how bad will it get using cachefs with > badly designed applications or client/server setups? There are a number of critical points: (1) Inodes I need to represent in the cache the files I have cached so that I can find the data attached to them and keep track of the last access times for the purposes of culling inodes to make space. There are several ways of doing this: (a) Set aside a portion of the disk as an inode table and search it end to end for each open attempt. Now assume I store one inode per sector (512 bytes) and have a table of 128*1024 entries; this means the worst case is that I have to read through 64MB (16384 pages) - and the worst case is going to happen every time there's a cache miss. Now this can be alleviated in a couple of ways: - By holding a bitmap, say, indicating which page-sized blocks actually have inodes in them, but that then precludes the use of do_generic_mapping_read() and readahead. - By simply keeping track of where the last used block is to cut short the scan and by moving inodes down or by always allocating as low a slot as possible. The really unpleasant side of this is that it will cycle on average at least half of this space through the page cache every time we scan the table. The average is likely to be worse than half if cache misses are taken into account. This also cuts out a chunk of the cache and makes it permanently unavailable. (b) Store metadata in an extensible file. This is what cachefs actually does. It's slower than (a), but does allow (almost) unlimited growth of the inode table and only uses as much of the space in the cache as is necessary. Since I keep a list of free inode entries, allocating a new one is very quick. However, the worst case performance for (b) is actually worse than for (a) because I not only have to read through the blocks of inode definitions, but I also have to read the pointer blocks, and there's on average one of those per 1024 inodes. Even worse is that reading a particular inode has to be synchronous with respect to walking the indirection chain. This also has the same unpleasant side effects on scanning the table as (a), only more so. (c) Store the metadata records in a tree. This would permit predictable and potentially constant time lookup for a particular inode, though we would have to walk the tree to find the particular inode we're looking for, which has to be synchronous. Scanning a tree is potentially even worse than for flat files like in (a) and (b) since you potentially have a lot more intermediate nodes to walk. However, the really big advantage of a tree is that it's a lot easier to remove dead space, thus compressing the tree, plus it only needs to eat up as much cache space as is necessary. Trees are a lot harder to juggle with the type of journalling that cachefs does now, however, since the worst case for fanning out any particular node is that you have to allocate as many new nodes as there are slots in the page you already have plus one. So if you make slots sector size, that's nine pages on a 4K page system, but 129 on a 64K page system, and you have to journal all this data... Ideally, for optimal lookup speed, you want your tree to be balanced, and that means dealing with rotation through nodes with more than two pointers. This is something else the current method of journalling is really poor at coping with. However, the use of a wandering tree rather than exquisite journalling would make matters easier here; the only journalling required would be the change of tree root and the change in state of the free block lists. The downside of a wandering tree is that you have to maintain sufficient free space to be able to wander in the performance of a deletion to make more space. (2) Indexing When a file is opened, CacheFS has to look in the cache to see if that file is represented there. This means searching the set of cached files for the one in which we're particularly interested. Now I could store the lookup keys in the inodes themselves, but this has two important consequences: (a) It restricts the amount of auxilliary data an inode can store; this includes such things as direct data pointers. (b) It restricts the size of a netfs key and auxilliary data that can be stored in an inode without increasing the on-disk inode representation. Furthermore, the keys of a particular class of object from a particular netfs are generally of the same sort of size. For instance AFS vnodes require a vnode ID (4 bytes) and a vnode uniquifier (4 bytes) per file. Nothing else. These can be packed much more closely than can inodes making them that much faster to search. Therefore, I chose to arrange cachefs as a set of homogenous indexes, where each index is defined by the netfs to hold elements of a particular size. This makes searching an index that much faster. However, since it's an index tree that's defined on disk, open() is still synchronous in that it has to walk down the index tree until it finds (or not) the particular data file it's looking for. So the rapid opening and closing of a lot of small files is going to be penalised; though this is reduced by the fact that we cache information about indexes we know about. For instance, in the NFS client caching, we have two layers of indexes: a server index (pinned by NFS server records), each entry of which points to another index that keeps track of the NFS files we know about by the NFS file handle. Unfortunately, NFS file handles are potentially quite large (128 bytes maximum for NFS4). This means that we can't fit many per page (about 30 on a 4K page). Now imagine that we're looking at a 2.6.11 kernel source tree on an NFS server; this has about 21000 files. This means that we require about 700 pages at least, more if the NFS filesystem wants to store auxilliary data as well. So the worst case lookup is reading through 700 pages (cache miss is worst case). If the key was stored in with the inodes, this would mean at least 2625 pages to be read (assuming 8 per page). Now using do_generic_mapping_read() alleviates some of the latency associated with this by doing readahead, but it's still quite a turnover of the page cache. If NFS had just the one index for all files on all servers, then the keys would be bigger, though maybe only by 4 bytes (IP address), permitting say 29 per page. Now assume you've got two servers, each with a kernel source tree; that brings the number of files to 42000, and a worst case lookup of say 1448 pages in the index or 5250 in the inode table directly. As you can see, keeping your indexes small greatly helps reduce the lookup times, provided you can keep information about the indexes pinned in memory to held you get to the bottom faster. However, having more levels of index and subdividing the key space between them brings its own pitfall: there are more levels, and walking them has to be synchronous. Not only that, but creating all these indexes on disk also has to be synchronous; and worse, has to be journalled. Now all this is for flat-file indexes, as cachefs currently uses. If, instead, cachefs used a tree the worst case lookup time would be (for a balanced tree): round_up(log1024(#objects)) * step_time For 4K pages. Not only that, but the maximum page cache thrash would be round_up(log1024(#objects)) too. So if you've got a million objects, then this would be 2. (3) Data storage Finally, once you've found your inode, you have to be able to retrieve data from it. cachefs does this by having a small number of direct pointers in the inode itself, plus a single-indirect pointer, plus a double indirect pointer, and plus potentionally higher-order indirection pointers. So for cachefs as it stands, up to the first 112 pages (448KB if 4K pages) of a file are pointed to directly. These pages are really fast to access because the inodes are held in memory as long as the fscache cookies persist. For further into a file peformance degrades as more and more levels of indirection blocks have to be traversed; but I think that's acceptable. The single indirection block covers the next 4MB of the file and then the double indirection block covers the next 4GB. We'd then have to use triple indirection for the following 4TB and so on. One disadvantage of doing this is that walking the indirection chain is, of course, synchronous; though the page cache will alleviate the problem somewhat. However, worse is that we have to journal the allocations of pages and pointer blocks. There are some ways of improving things: (a) Extents When the netfs presents some pages for caching, if these are all contiguous in the inode's page space then they could be represented on disk as a start page index, a size and either a list of blocks or the first block of a contiguous chunk of disk. Extents are quite tricky to deal with, depending on the degree of data integrity you want. How do you deal with overlapping extents? Do you just overwrite the first extent with the data from the second and then write an abbreviated second extent (and perhaps append as much of the second onto the first if possible)? What if you're guaranteeing not to overwrite any data? Also you really need a flexible tree structure to manage extents, I think. On-disk contiguous extents are also impractical as it's unlikely that you'd be able to scrape together large runs of adjacent blocks easily. (b) Larger grained allocations Performance could be improved at the cost of lowering the utilisation of the disk by allocating blocks in groups. Pointers in pointer blocks would then become a tuple containing a pointer to the block group and a bitmap of the usage of the blocks in that group - we mustn't return uninitialised data to the netfs. This would allow one tuple to address up to 32 blocks (assuming a 32-bit bitmap); a coverage of 128KB with 4KB pages and 2MB with 64KB pages. However, the chunking factor could be anything from 2 to 32; a factor of 4 (16KB) might be a good compromise. It would also be possible to create power-of-2 gradated allocation bands in the cache, and attempt to allocate an appropriately sized chunk, depending on the relationship between the target page group position in the inode and the size of the inode's data. This, however, would complicate the free block list handling as each band would require its own free list list maintenance This sort of thing would also easy to provide mount-time tuning for. (4) Other objects One thing cachefs doesn't supply facilities for is caching other types of objects, such as extended attributes. This is tricky with the flat file indexing used as there's nowhere really to store a list of the objects required. A move to a tree based filesystem would make this possible. > for instance, what happens when the client's cache disk is much slower > than the server (high performance RAID with high speed networking)? Then using a local cache won't help you, no matter how hard it tries, except in the following circumstances: (1) The server is not available. (2) The network is heavily used by more than just one machine. (3) The server is very busy. > what happens when the client's cache disk fills up so the disk cache is > constantly turning over (which files are kicked out of your backing > cachefs to make room for new data)? I want that to be based on an LRU approach, using last access times. Inodes pinned by being open can't be disposed of and neither can inodes pinned by being marked so; but anything else is fair game for culling. The cache has to be scanned occasionally to build up a list of inodes that are candidates for being culled, and I think a certain amount of space must be kept available to satisfy allocation requests; therefore the culling needs thresholds. Unfortunately, culling is going to be slower than allocation in general because we always know where we're going to allocate, but we have to search for something to get the chop. > what happens with multi-threaded I/O-bound applications when the cachefs is > on a single spindle? I don't think that multi-threaded applications are actually distinguishable on Linux from several single-threaded applications. Allocation has to be a serialised if proper filesystem integrity is to be maintained and if there is to be proper error handling. I do, however, try and keep the critical section as small and as fast as possible. > is there any performance dependency on the size of the backing cachefs? The blocks holding virtually contiguous data may be scattered further apart on the disk. This is unfortunate, but unless I can reclaim specific blocks or use larger grained allocations there's not a lot I can do about that. > do you also cache directory contents on disk? That's entirely up to the netfs. AFS, yes; NFS, no. If cachefs were extended to support the caching of other types of object, this would become easier for NFS. > remember that the application you designed this for (preserving cache > contents across client reboots) is only one way this will be used. some > of us would like to use this facility to provide a high-performance > local cache larger than the client's RAM. :^) Agreed. I'd like to be able to disable the journal. I'd also like to be able to use it for low priority swap space. Obviously swap space can't be evicted from the cache without approval from the swapper. > synchronous file system metadata management is the bane of every cachefs > implementation i know about. Yeah. But imagine you work for a company with /usr mounted over the network by every machine. Then the power fails. When the power comes back, all these machines wake up, see their cache is in a bad state, reinitialise it and immediately splat the server trying to regain /usr. > have you measured what performance impact there is when cache files go from > no indirection to single indirect blocks, or from single to double > indirection? have you measured how expensive it is to reuse a single cache > file because the cachefs file system is already full? how expensive is it > to invalidate the data in the cache (say, if some other client changes a > file you already have cached in your cachefs)? Not yet. All these things require a journal update, but the changes don't actually happen immediately, and we don't actually spend that much time waiting around, except when we need to read something from the disk first. The journal manager makes sure that the altered blocks hit the disk after the journal does, and that happens in a background thread. Changes are batched up, and we can write up to about 4000 in a batch (we must not overwrite the previous BATCH and ACK marks in the journal). > what about using an extent-based file system for the backing cachefs? > that would probably not be too difficult because you have a good > prediction already of how large the file will be (just look at the file > size on the server). Extents make deletion a lot slower, I suspect, because the structure is a lot more flexible. Extents also do not eliminate indirection; they merely move it elsewhere - an extent tree is indirect. I'm working on making cachefs wandering tree based at the moment, at least for the indexing. I've considered having data pointer blocks (extents) as objects in the tree, but it's actually more complicated that way:-/ > how about using smallish chunks, like the AFS cache manager, to avoid > indirection entirely? In what way does this help? You have to, for example, be able to cope with a 64-bit machine requesting pages from anywhere within a file, so you might get a request for page 0xFFFFFFFFFFFFFFFF from a sparse file and have to be able to deal with it. How are you going to handle that without indirection? CacheFS at the moment can't deal with that, but it wouldn't be too hard to make it possible, it merely requires 7th-level indirection (I think). And if I move to 64-bit block pointers at some point, you'd be able to store almost that many pages, assuming you could find a block device big enough. > would there be any performance advantage to caching small files in memory > and large files on disk, or vice versa? Well, you have the page cache; but how do you decide what should be kept in memory and what should be committed to disk? If you keep small files in memory, then you lose them if the power goes off or you reboot, and if you're trying to operate disconnectedly, then you've got a problem. David