Lever, Charles <Charles.Lever@xxxxxxxxxx> wrote: > 1. attributes cached on the disk are either up to date, or clearly out > of date (otherwise there's no way to tell whether cached data is stale > or not), and We have to trust the netfs to know when an inode is obsolete. That's why cachefs calls back to the netfs to validate the inodes it finds. With AFS this checks the vnode version number and the data version number. > in fact, you don't need to maintain data coherency up to the very last > moment, since the client is pushing data to the server for permanent > storage. cached data in the local backing FS can be out of date after a > client reboot without any harm whatever, so it doesn't matter a wit that > the on-disk state of the backing FS trails the page cache. True, but also consider that the fact that if a netfs wants to throw a page into the cache, it must keep it around long enough for us to write it to disk. So if the user is grabbing a file say twice the size as the maximum pagecache size, being too lazy will hold up the read as the VM then tries to eject pages that are pending writing to the cache. Actually, the best way to do this would be to get the VM involved in the caching, I think. Currently, the netfs has to issue a write to the disk, and there're only certain points at which it's able to do that: - readpage completion - page release - writepage (if the page is altered locally) The one that's going to impact performance least is when the netfs finishes reading a page. Getting the VM involved would allow the VM to batch up writes to the cache and to predict better when to do the writes. One of the problems I've got is that I'd like to be able to gang up writes to the cache, but that's difficult as the pages tend to be read individually across the network, and thus complete individually. Furthermore, consider the fact that the netfs records state tracking information in the cache (such as AFS's data version). This must be modified after the changed pages are written to the cache (or deleted from it) lest you end up with old data for the version specified. > (of course you do need to sync up completely with the server if you > intend to use CacheFS for disconnected operation, but that can be > handled by "umount" rather than keeping strict data coherency all the > time). Disconnected operation is a whole 'nother kettle of miscellaneous swimming things. One of the reasons I'd like to move to a wandering tree is that it makes data journalling almost trivial; and if the tree is arranged correctly, it makes it possible to get a free inode update too - thus allowing the netfs coherency data to be updated simultaneously. > it also doesn't matter if the backing FS can't keep up with the server. > the failure mode can be graceful, so that as the backing FS becomes > loaded, it passes more requests back to the server and caches less data > and fewer requests. this is how it works when there is more data to > cache than there is space to cache it; it should work the same way if > the I/O rate is higher than the backing FS can handle. True. I've defined the interface to return -ENOBUFS if we can't cache a file right now, or just to silently drop the thing and tell the netfs we did it. The read operation then would return -ENODATA next time, thus indicating we need to fetch it from the server again. The best way to do it is probably to have hysteresis on allocation for insertion: suspend insertion if the number of free blocks falls below some limit and re-enable insertion if the number of free blocks rises above a higher count. Then set the culler running if we drop below the higher limit. And then, if insertion is suspended, we start returning -ENOBUFS on requests to cache something. Not only that, but if a netfs wants to update a block, we can also return -ENOBUFS and steal the block that held the old data (with a wandering tree that's fairly easy to do). The stolen block can then be laundered and made available to the allocator again. > > Actually, probably the biggest bottleneck is the disk block allocator. > > in my experience with the AFS cache manager, this is exactly the > problem. the ideal case is where the backing FS behaves a lot like swap > -- just get the bits down onto disk in any location, without any > sophisticated management of free space. the problem is keeping track of > the data blocks during a client crash/reboot. Which is something swap space doesn't need to worry about. It's reinitialised on boot. Filesystem integrity is not an issue. If we don't care about integrity, life is easy. The main problem in the allocator is one of tentative allocation vs journal update moving the free list pointer. If I can hold off on the latter or just discard the former and send the tentative block for relaundering, then I can probably reduce the serialisation problems. > the real problem arises when the cache is full and you want to cache a > new file. the cache manager must choose a file to reclaim, release all > the blocks for that file, then immediately reallocate them for the new > file. all of this is synchronous activity. Not exactly. I plan to have cachefs anticipate the need by keeping a float of free blocks. Whilst this reduces the utilisation of the cache, it should decrease the allocator latency. > are there advantages to a log-structured file system for this purpose? Yes, but there are a lot more problems, and the problems increase with cache size: (1) You need to know what's where in the cache; that means scanning the cache on mount. You could offset this by storing your map in the block device and only scanning on power failure (when the blockdev wasn't properly unmounted). (2) You need to keep a map in memory. I suppose you could keep the map on disk and rebuild it on umount and mount after power failure. But this does mean scanning the entire block device. (3) When the current point in the cache catches up with the tail, what do you do? Do you just discard the block at the tail? Or do you move it down if it's not something you want to delete yet? (And how do you decide which?) This will potentially have the effect of discarding regularly used items from the cache at regular intervals; particularly if someone uses a data set larger than the the size of the cache. (4) How do you decide where the current point is? This depends on whether you're willing to allow pages to overlap "page boundaries" or not. You could take a leaf out of JFFS2's book and divide the cache into erase blocks, each of several pages. This would then cut down on the amount of scanning you need to do, and would make handling small files trivial. If you can get this right it would be quite cute, but it would make handling of pinned files awkward. You can't throw away anything that's pinned, but must slide it down instead. Now imagine half your cache is pinned - you're potentially going to end up spending a lot of time shovelling stuff down, unless you can skip blocks that are fully pinned. > is there a good way to trade disk space for the performance of your > block allocator? Potentially. If I can keep a list of tentative allocations and add that to the journal, then it's easy to zap then during replay. It does, however, complicate journalling. > in fact, with an infinitely fast server and network, there would be no > need for local caching at all. so maybe that's not such an interesting > thing to consider. Not really. The only thing it guards against is the server becoming unavailable. > it might be more appropriate to design, configure, and measure CacheFS > with real typical network and server latency numbers in mind. Yes. What I'm currently using as the basis of my design is accessing a kernel source tree over the network. That's on the order of 22000 files these days and 320MB of disk space. That's an average occupancy of about 14.5KB of space per file. As an alternative load, I consider what it would take to cache /usr. That's about 373000 files on my box and 11GB of disk space; that's about 29KB per file. > david, what is the behavior when the file that needs to be cached is > larger than the backing file system? for example, what happens when > some client application starts reading a large media file that won't fit > entirely in the cache? Well, it depends. If it's a large sparse file that we're only going to grab a few blocks from then it's not a problem, but if it's something we're going to read all of then obviously we've got a problem. I think I have to set a limit as to the maximum number of blocks a file can occupy in a cache. Beyond that, if a file occupies all the blocks of cache it can, then we have to refuse allocation of more blocks for it. What happens then depends. If the file is officially pinned by the user, we can't actually get rid of it, and all we can do is give them rude error messages. If it's merely pinned by virtue of being held by an fscache cookie, then we could just keep it around until the cookie is released or we could just start silently recycling the space it's currently occupying whilst returning -ENOBUFS to all further attempts to cache more of that file. But unless the user gives as a hint, we can't judge in advance what the best acttion is. I'd consider O_DIRECT as being a hint, of course, but we may want to make other options available. David