On Thu, Jan 25, 2024 at 02:02:27PM +0000, David Howells wrote: > Here's a roadmap for the future development of netfslib and local caching > (e.g. cachefiles). > > Netfslib > ======== > > [>] Current state: > > The netfslib write helpers have gone upstream now and are in v6.8-rc1, with > both the 9p and afs filesystems using them. This provides larger I/O size > support to 9p and write-streaming and DIO support to afs. > > The helpers provide their own version of generic_perform_write() that: > > (1) doesn't use ->write_begin() and ->write_end() at all, completely taking > over all of of the buffered I/O operations, including writeback. > > (2) can perform write-through caching, setting up one or more write > operations and adding folios to them as we copy data into the pagecache > and then starting them as we finish. This is then used for O_SYNC and > O_DSYNC and can be used with immediate-write caching modes in, say, cifs. > > Filesystems using this then deal with iov_iters and ideally would not deal > pages or folios at all - except incidentally where a wrapper is necessary. > > > [>] Aims for the next merge window: > > Convert cifs to use netfslib. This is now in Steve French's for-next branch. > > Implement content crypto and bounce buffering. I have patches to do this, but > it would only be used by ceph (see below). > > Make libceph and rbd use iov_iters rather than referring to pages and folios > as much as possible. This is mostly done and rbd works - but there's one bit > in rbd that still needs doing. > > Convert ceph to use netfslib. This is about half done, but there are some > wibbly bits in the ceph RPCs that I'm not sure I fully grasp. I'm not sure > I'll quite manage this and it might get bumped. > > Finally, change netfslib so that it uses ->writepages() to write data to the > cache, even data on clean pages just read from the server. I have a patch to > do this, but I need to move cifs and ceph over first. This means that > netfslib, 9p, afs, cifs and ceph will no longer use PG_private_2 (aka > PG_fscache) and Willy can have it back - he just then has to wrest control > from NFS and btrfs. > > > [>] Aims for future merge windows: > > Using a larger chunk size than PAGE_SIZE - for instance 256KiB - but that > might require fiddling with the VM readahead code to avoid read/read races. > > Cache AFS directories - there are just files and currently are downloaded and > parsed locally for readdir and lookup. > > Cache directories from other filesystems. > > Cache inode metadata, xattrs. Implications for permission checking might get interesting depending on how that's supposed to work for filesystems such as cephfs that support idmapped mounts. But I need to understand more details to say something less handwavy. > > Add support for fallocate(). > > Implement content crypto in other filesystems, such as cifs which has its own > non-fscrypt way of doing this. > > Support for data transport compression. > > Disconnected operation. > > NFS. NFS at the very least needs to be altered to give up the use of > PG_private_2. > > > Local Caching > ============= > > There are a number of things I want to look at with local caching: > > [>] Although cachefiles has switched from using bmap to using SEEK_HOLE and > SEEK_DATA, this isn't sufficient as we cannot rely on the backing filesystem > optimising things and introducing both false positives and false negatives. > Cachefiles needs to track the presence/absence of data for itself. > > I had a partially-implemented solution that stores a block bitmap in an xattr, > but that only worked up to files of 1G in size (with bits representing 256K > blocks in a 512-byte bitmap). > > [>] An alternative cache format might prove more fruitful. Various AFS > implementations use a 'tagged cache' format with an index file and a bunch of > small files each of which contains a single block (typically 256K in OpenAFS). > > This would offer some advantages over the current approach: > > - it can handle entry reuse within the index > - doesn't require an external culling process > - doesn't need to truncate/reallocate when invalidating > > There are some downsides, including: > > - each block is in a separate file > - metadata coherency is more tricky - a powercut may require a cache wipe > - the index key is highly variable in size if used for multiple filesystems > > But OpenAFS has been using this for something like 30 years, so it's probably > worth a try. > > [>] Need to work out some way to store xattrs, directory entries and inode > metadata efficiently. > > [>] Using NVRAM as the cache rather than spinning rust. > > [>] Support for disconnected operation to pin desirable data and keep > track of changes. > > [>] A user API by which the cache for specific files or volumes can be > flushed. > > > Disconnected Operation > ====================== > > I'm working towards providing support for disconnected operation, so that, > provided you've got your working set pinned in the cache, you can continue to > work on your network-provided files when the network goes away and resync the > changes later. As long as it doesn't involve upcalls... :)