On Thu, 2020-07-30 at 12:51 +0100, David Howells wrote: > Hi Linus, Trond/Anna, Steve, Eric, > > I have an fscache rewrite that I'm tempted to put in for the next merge > window: > > https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@xxxxxxxxxxxxxxxxxxxxxx/ > > It improves the code by: > > (*) Ripping out the stuff that uses page cache snooping and kernel_write() > and using kiocb instead. This gives multiple wins: uses async DIO rather > than snooping for updated pages and then copying them, less VM overhead. > > (*) Object management is also simplified, getting rid of the state machine > that was managing things and using a much simplified thread pool instead. > > (*) Object invalidation creates a tmpfile and diverts new activity to that so > that it doesn't have to synchronise in-flight ADIO. > > (*) Using a bitmap stored in an xattr rather than using bmap to find out if > a block is present in the cache. Probing the backing filesystem's > metadata to find out is not reliable in modern extent-based filesystems > as them may insert or remove blocks of zeros. Even SEEK_HOLE/SEEK_DATA > are problematic since they don't distinguish transparently inserted > bridging. > > I've provided a read helper that handles ->readpage, ->readpages, and > preparatory writes in ->write_begin. Willy is looking at using this as a way > to roll his new ->readahead op out into filesystems. A good chunk of this > will move into MM code. > > The code is simpler, and this is nice too: > > 67 files changed, 5947 insertions(+), 8294 deletions(-) > > not including documentation changes, which I need to convert to rst format > yet. That removes a whole bunch more lines. > > But there are reasons you might not want to take it yet: > > (1) It starts off by disabling fscache support in all the filesystems that > use it: afs, nfs, cifs, ceph and 9p. I've taken care of afs, Dave > Wysochanski has patches for nfs: > > https://lore.kernel.org/linux-nfs/1596031949-26793-1-git-send-email-dwysocha@xxxxxxxxxx/ > > but they haven't been reviewed by Trond or Anna yet, and Jeff Layton has > patches for ceph: > > https://marc.info/?l=ceph-devel&m=159541538914631&w=2 > > and I've briefly discussed cifs with Steve, but nothing has started there > yet. 9p I've not looked at yet. > > Now, if we're okay for going a kernel release with 4/5 filesystems with > caching disabled and then pushing the changes for individual filesystems > through their respective trees, it might be easier. > > Unfortunately, I wasn't able to get together with Trond and Anna at LSF > to discuss this. > > (2) The patched afs fs passed xfstests -g quick (unlike the upstream code > that oopses pretty quickly with caching enabled). Dave and Jeff's nfs > and ceph code is getting close, but not quite there yet. That was my experience on cephfs+fscache too -- it often crashed down in the fscache code. I'd support the approach in (1) above -- put this in soon and disable the caches in the filesystems. Then push the changes to reenable it via fs-specific trees. The ceph patch series is more or less ready. It passes all of the xfstest "quick" group run (aside from a few expected failures on cephfs). The only real exception is generic/531, which seems to trigger OOM kills in my testing. The test tries to create a ton of files and leak the file descriptors. I tend to think that workload is pretty unusual, and given that fscache was terribly unstable and crashed before, this is still a marked improvement. > (3) Al has objections to the ITER_MAPPING iov_iter type that I added > > https://lore.kernel.org/linux-fsdevel/20200719014436.GG2786714@xxxxxxxxxxxxxxxxxx/ > > but note that iov_iter_for_each_range() is not actually used by anything. > > However, Willy likes it and would prefer to make it ITER_XARRAY instead > as he might be able to use it in other places, though there's an issue > where I'm calling find_get_pages_contig() which takes a mapping (though > all it does is then get the xarray out of it). > > Instead I would have to use ITER_BVEC, which has quite a high overhead, > though it would mean that the RCU read lock wouldn't be necessary. This > would require 1K of memory for every 256K block the cache wants to read; > for any read >1M, I'd have to use vmalloc() instead. > > I'd also prefer not to use ITER_BVEC because the offset and length are > superfluous here. If ITER_MAPPING is not good, would it be possible to > have an ITER_PAGEARRAY that just takes a page array instead? Or, even, > create a transient xarray? > > (4) The way object culling is managed needs overhauling too, but that's a > whole 'nother patchset. We could wait till that's done too, but its lack > doesn't prevent what we have now being used. > > Thoughts? > > David > -- Jeff Layton <jlayton@xxxxxxxxxx>