Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the > > file for every whole operation (which may combine reads and writes). > > I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA > to keep track of present data, it's hard to distinguish between an > invalid cached range and a valid "cached hole". I wasn't exactly intending to permit caching over NFS. That leads to fun making sure that the superblock you're caching isn't the one that has the cache in it. However, we will need to handle hole-punching being done on a cached netfs, even if that's just to completely invalidate the cache for that file. > With ->fiemap() you can at least make the distinction between a non existing > and an UNWRITTEN extent. I can't use that for XFS, Ext4 or btrfs, I suspect. Christoph and Dave's assertion is that the cache can't rely on the backing filesystem's metadata because these can arbitrarily insert or remove blocks of zeros to bridge or split extents. > You didn't say much about crash consistency or durability requirements of the > cache. Since cachefiles only syncs the cache on shutdown, I guess you > rely on the hosting filesystem to provide the required ordering guarantees. There's an xattr on each file in the cache to record the state. I use this mark a cache file "open". If, when I look up a file, the file is marked open, it is just discarded at the moment. Now, there are two types of data stored in the cache: data that has to be stored as a single complete blob and is replaced as such (e.g. symlinks and AFS dirs) and data that might be randomly modified (e.g. regular files). For the former, I have code, though in yet another branch, that writes this in a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over. For the latter, that's harder to do as it would require copying the data to the tmpfile before we're allowed to modify it. However, if it's possible to create a tmpfile that's a CoW version of a data file, I could go down that route. But after I've written and sync'd the data, I set the xattr to mark the file not open. At the moment I'm doing this too lazily, only doing it when a netfs file gets evicted or when the cache gets withdrawn, but I really need to add a queue of objects to be sealed as they're closed. The balance is working out how often to do the sealing as something like a shell script can do a lot of consecutive open/write/close ops. > How does this work with write through network fs cache if the client system > crashes but the write gets to the server? The presumption is that the coherency info on the server will change, but won't get updated in the cache. > Client system get restart with older cached data because disk caches were > not flushed before crash. Correct? Is that case handled? Are the caches > invalidated on unclean shutdown? The netfs provides some coherency info for the cache to store. For AFS, for example, this is the data version number (though it should probably include the volume creation time too). This is stored with the state info in the same xattr and is only updated when the "open" state is cleared. When the cache file is reopened, if the coherency info doesn't match what we're expecting (presumably we queried the server), the file is discarded. (Note that the coherency info is netfs-specific) > Anyway, how are those ordering requirements going to be handled when entire > indexing is in a file? You'd practically need to re-implement a filesystem Yes, the though has occurred to me too. I would be implementing a "simple" filesystem - and we have lots of those:-/. The most obvious solution is to use the backing filesystem's metadata - except that that's not possible. > journal or only write cache updates to a temp file that can be discarded at > any time? It might involve keeping a bitmap of "open" blocks. Those blocks get invalidated when the cache restarts. The simplest solution would be to wipe the entire cache in such a situation, but that goes against one of the important features I want out of it. Actually, a journal of open and closed blocks might be better, though all I really need to store for each block is a 32-bit number. It's a particular problem if I'm doing DIO to the data storage area but buffering the changes to the metadata. Further, the metadata and data might be on different media, just to add to the complexity. Another possibility is only to cull blocks when the parent file is culled. That probably makes more sense as, as long as the file is registered culled on disk first and I don't reuse the file slot too quickly, I can write to the data store before updating the metadata. > If you come up with a useful generic implementation of a "file data > overlay", overlayfs could also use it for "partial copy up" as well as for > implementation of address space operations, so please keep that in mind. I'm trying to implement things so that the netfs does look-aside when reading, and multi-destination write-back when writing - but the netfs is in the driving seat and the cache is invisible to the user. I really want to avoid overlaying the cache on the netfs so that the cache is the primary access point. David