Roadmap for netfslib and local caching (cachefiles)

David Howells <dhowells@xxxxxxxxxx> · Thu, 25 Jan 2024 14:02:27 +0000

Here's a roadmap for the future development of netfslib and local caching
(e.g. cachefiles).

Netfslib
========

[>] Current state:

The netfslib write helpers have gone upstream now and are in v6.8-rc1, with
both the 9p and afs filesystems using them.  This provides larger I/O size
support to 9p and write-streaming and DIO support to afs.

The helpers provide their own version of generic_perform_write() that:

 (1) doesn't use ->write_begin() and ->write_end() at all, completely taking
     over all of of the buffered I/O operations, including writeback.

 (2) can perform write-through caching, setting up one or more write
     operations and adding folios to them as we copy data into the pagecache
     and then starting them as we finish.  This is then used for O_SYNC and
     O_DSYNC and can be used with immediate-write caching modes in, say, cifs.

Filesystems using this then deal with iov_iters and ideally would not deal
pages or folios at all - except incidentally where a wrapper is necessary.

[>] Aims for the next merge window:

Convert cifs to use netfslib.  This is now in Steve French's for-next branch.

Implement content crypto and bounce buffering.  I have patches to do this, but
it would only be used by ceph (see below).

Make libceph and rbd use iov_iters rather than referring to pages and folios
as much as possible.  This is mostly done and rbd works - but there's one bit
in rbd that still needs doing.

Convert ceph to use netfslib.  This is about half done, but there are some
wibbly bits in the ceph RPCs that I'm not sure I fully grasp.  I'm not sure
I'll quite manage this and it might get bumped.

Finally, change netfslib so that it uses ->writepages() to write data to the
cache, even data on clean pages just read from the server.  I have a patch to
do this, but I need to move cifs and ceph over first.  This means that
netfslib, 9p, afs, cifs and ceph will no longer use PG_private_2 (aka
PG_fscache) and Willy can have it back - he just then has to wrest control
from NFS and btrfs.

[>] Aims for future merge windows:

Using a larger chunk size than PAGE_SIZE - for instance 256KiB - but that
might require fiddling with the VM readahead code to avoid read/read races.

Cache AFS directories - there are just files and currently are downloaded and
parsed locally for readdir and lookup.

Cache directories from other filesystems.

Cache inode metadata, xattrs.

Add support for fallocate().

Implement content crypto in other filesystems, such as cifs which has its own
non-fscrypt way of doing this.

Support for data transport compression.

Disconnected operation.

NFS.  NFS at the very least needs to be altered to give up the use of
PG_private_2.

Local Caching
=============

There are a number of things I want to look at with local caching:

[>] Although cachefiles has switched from using bmap to using SEEK_HOLE and
SEEK_DATA, this isn't sufficient as we cannot rely on the backing filesystem
optimising things and introducing both false positives and false negatives.
Cachefiles needs to track the presence/absence of data for itself.

I had a partially-implemented solution that stores a block bitmap in an xattr,
but that only worked up to files of 1G in size (with bits representing 256K
blocks in a 512-byte bitmap).

[>] An alternative cache format might prove more fruitful.  Various AFS
implementations use a 'tagged cache' format with an index file and a bunch of
small files each of which contains a single block (typically 256K in OpenAFS).

This would offer some advantages over the current approach:

 - it can handle entry reuse within the index
 - doesn't require an external culling process
 - doesn't need to truncate/reallocate when invalidating

There are some downsides, including:

 - each block is in a separate file
 - metadata coherency is more tricky - a powercut may require a cache wipe
 - the index key is highly variable in size if used for multiple filesystems

But OpenAFS has been using this for something like 30 years, so it's probably
worth a try.

[>] Need to work out some way to store xattrs, directory entries and inode
metadata efficiently.

[>] Using NVRAM as the cache rather than spinning rust.

[>] Support for disconnected operation to pin desirable data and keep
track of changes.

[>] A user API by which the cache for specific files or volumes can be
flushed.

Disconnected Operation
======================

I'm working towards providing support for disconnected operation, so that,
provided you've got your working set pinned in the cache, you can continue to
work on your network-provided files when the network goes away and resync the
changes later.

This is going to require a number of things:

 (1) A user API by which files can be preloaded into the cache and pinned.

 (2) The ability to track changes in the cache.

 (3) A way to synchronise changes on reconnection.

 (4) A way to communicate to the user when there's a conflict with a third
     party change on reconnect.  This might involve communicating via systemd
     to the desktop environment to ask the user to indicate how they'd like
     conflicts recolved.

 (5) A way to prompt the user to re-enter their authentication/crypto keys.

 (6) A way to ask the user how to handle a process that wants to access data
     we don't have (error/wait) - and how to handle the DE getting stuck in
     this fashion.

David