Modifying and fixing(?) the per-inode snap handling in ceph

David Howells <dhowells@xxxxxxxxxx> · Mon, 15 Jan 2024 14:07:18 +0000

Hi, Ilya, Xiubo, Greg,

I'm trying to finish my patches to make ceph work with netfslib and I'm
wondering if snap handling on inodes can be made easier to work with.  Also, I
think there may be a bug in the interaction between ceph_queue_cap_snap() and
writable mmaps.

What I would like to do is to make page/folio->private point at the
ceph_cap_snap struct instead of pointing to ceph_snap_context.  This makes it
easier to fish the metadata details out in ceph when netfslib asks it to
perform a write operation.

Netfslib has the capability to pass an netfs_group struct through the API, and
I currently have this subclassed by ceph_snap_context, but that doesn't
directly carry sufficient information as I presume that's a global thing and
not an inode-specific thing.

However, it looks like capsnaps don't always exist, even on dirty inodes...

So what I'm thinking is:

 (1) Make struct ceph_cap_snap a subclass of netfs_group.  This would allow
     netfslib to manipulate them and attach them to dirty pages and do
     selective writeback.

 (2) Always keep a ceph_cap_snap on a dirty inode.  It can be treated
     specially when it's the only snap and at the head.

 (3) Offload some of the fields from ceph_inode_info into ceph_cap_snap
     (eg. truncate_size and truncate_seq) and update them directly there.

 (4) On entry to any sort of write routine, see if we need a new capsnap for
     that inode and, if so, create one.  This would include ->write_iter(),
     ->page_mkwrite(), ->setattr(), possibly ->setxattr(),

 (5) In queue_realm_cap_snaps(), mark the capsnap as being obsolete and call
     unmap_mapping_pages() on each inode to force ->page_mkwrite() to be
     called[!] on further modification.

     queue_realm_cap_snaps() doesn't then need to create a new snapcap; this
     can be left to the various write routines.

     [!] This would fix the aforementioned potential bug whereby someone can
     continue writing to the inode even though a new snap has happened.

 (6) ceph_writepages() calls netfs_writepages_group() to flush out pages with
     the matching group, stepping through the capsnap list on the inode.

Any thoughts on whether this would work?  If I can do this, I can reduce
get_oldest_context() to almost nothing and don't need the ceph_writeback_ctl
struct anymore (I think).

Thanks,
David