Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > I've been working on getting NFS converted to dhowells new fscache > > and > > netfs APIs and running into a problem with how NFS is designed and it > > involves the NFS pagelist.c / pgio API. I'd appreciate it if you > > could review and give your thoughts on possible approaches. I've > > tried to outline some of the possibilities below. I tried coding > > option #3 and ran into some problems, and it has a serialization > > limitation. At this point I'm leaning towards option 2, so I'll > > probably try that approach if you don't have time for review or have > > strong thoughts on it. > > I am not going through another redesign of the NFS code in order to > accommodate another cachefs design. If netfs needs a refactoring or > redesign of the I/O code then it will be immediately NACKed. > > Why does netfs need to know these details about the NFS code anyway? There are some issues we have to deal with in fscache - and some opportunities. (1) The way cachefiles reads data from the cache is very hacky (calling readpage on the backing filesystem and then installing an interceptor on the waitqueue for the PG_locked page flag on that page, then memcpying the page in a worker thread) - but it was the only way to do it at the time. Unfortunately, it's fragile and it seems just occasionally the wake event is missed. Since then, kiocb has come along. I really want to switch to using this to read/write the cache. It's a lot more robust and also allows async DIO to be performed, also cutting out the memcpy. Changing the fscache IO part of API would make this easier. (2) The way cachefiles finds out whether data is present (using bmap) is not viable on ext4 or xfs and has to be changed. This means I have to keep track of the presence of data myself, separately from the backing filesystem's own metadata. To reduce the amount of metadata I need to keep track of and store, I want to increase the cache granularity - that is I will only store, say, blocks of 256K. But that needs to feed back up to the filesystem so that it can ask the VM to expand the readahead. (3) VM changes are coming that affect the filesystem address space operations. THP is already here, though not rolled out into all filesystems yet. Folios are (probably) on their way. These manage with page aggregation. There's a new readahead interface function too. This means, however, that you might get an aggregate page that is partially cached. In addition, the current fscache IO API cannot deal with these. I think only 9p, afs, ceph, cifs, nfs plus orangefs don't support THPs yet. The first five Willy has held off on because fscache is a complication and there's an opportunity to make a single solution that fits all five. Also to this end, I'm trying to make it so that fscache doesn't retain any pointers back into the network filesystem structures, beyond the info provided to perform a cache op - and that is only required on a transient basis. (4) I'd like to be able to encrypt the data stored in the local cache and Jeff Layton is adding support for fscrypt to ceph. It would be nice if we could share the solution with all of the aforementioned bunch of five filesystems by putting it into the common library. So with the above, there is an opportunity to abstract handling of the VM I/O ops for network filesystems - 9p, afs, ceph, cifs and nfs - into a common library that handles VM I/O ops and translates them to RPC calls, cache reads and cache writes. The thought is that we should be able to push the aggregation of pages into RPC calls there, handle rsize/wsize and allow requests to be sliced up and so that they can be distributed to multiple servers (works for ceph) so that all five filesystems can get the same benefits in one go. Btw, I'm also looking at changing the way indexing works, though that should only very minorly alter the nfs code and doesn't require any restructuring. I've simplified things a lot and I'm hoping to remove a couple of thousand lines from fscache and cachefiles. David