Re: Ceph and Netfslib

Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> · Wed, 18 Dec 2024 19:06:56 +0000

Hi David,

On Wed, 2024-12-18 at 18:33 +0000, David Howells wrote:
> Hi Alex, Slava,
> 
> I don't know whether you know, but I'm working on netfslib-ising ceph
> with an
> eye to moving all the VFS/VM normal I/O interfaces to netfslib (-
> >read_iter,
> ->write_iter, ->readahead, ->read_folio, ->page_mkwrite, -
> >writepages), though
> with wrapping/methods by which each network filesystem can add its
> own
> distinctive flavour.
> 
> Also, that would include doing things like content encryption, since
> that is
> generally useful in filesystems and I have plans to support it in
> both AFS and
> CIFS as well.
> 
> This means that fs/ceph/ will have practically nothing to do with
> page structs
> or folio structs.  All that will be offloaded to netfslib and
> netfslib will
> just hand iov_iters to the client filesystems, including ceph.
> 
> This will also allow me to massively simplify the networking code in
> net/ceph/.  My aim is to replace all the page array, page lists, bio,
> etc. data types in libceph with a single type that just conveys an
> iov_iter
> and I have a ceph_databuf type that holds a list of pages in the form
> of a
> bio_vec[] and I can extract an iov_iter from that to pass to the
> networking.
> 

Sounds like a really good idea to me. Makes a lot of sense.

> Then, for the transmission side, the iov_iter will be passed to the
> TCP socket
> with MSG_SPLICE_PAGES rather than iterating over the data type and
> passing a
> page fragment at a time.  We fixed this up for nfsd and Chuck Lever
> reported a
> improvement in throughput (15% if I remember correctly).
> 
> The patches I have so far can be found here:
> 
> 	
> https://git.kernel.org/p 
> ub_scm_linux_kernel_git_dhowells_linux-2Dfs.git_log_-3Fh-3Dceph-
> 2Diter&d=DwIFAg&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMW
> Kq4Y4RAkElvUgSs00&m=IdnhHmTDiQZNzP_zbJHD5PQFfO3U8UaEuGpDubyf8fFXBu4KQ
> 7NFE-0OklCCoqtp&s=v_tEim-OriGJ7-Mwdc9jHMW6Aj_7RKr5ZwGwjg5gfy8&e= 
> 
> Note that I have rbd working with the changes I've made to that
> point.
> 
> 
> Anyway, ... I need to pick someone's brain about whether the way per-
> page
> tracking of snapshots within fs/ceph/ can be simplified.
> 
> Firstly, note that there may be a bug in ceph writeback cleanup as it
> stands.
> It calls folio_detach_private() without holding the folio lock (it
> holds the
> writeback lock, but that's not sufficient by MM rules).  This means
> you have a
> race between { setting ->private, setting PG_private and inc refcount
> } on one
> hand and { clearing ->private, clearing PG_private and dec refcount }
> on the
> other.
> 

I assume you imply ceph_invalidate_folio() method. Am I correct here?

> Unfortunately, you cannot just take the page lock from writeback
> cleanup
> without running the risk of deadlocking against ->writepages()
> wanting to take
> PG_lock and then PG_writeback.  And you cannot drop PG_writeback
> first as the
> moment you do that, the page can be deallocated.
> 
> 
> Secondly, there's a counter, ci->i_wrbuffer_ref, that might actually
> be
> redundant if we do it right as I_PINNING_NETFS_WB offers an
> alternative way we
> might do things.  If we set this bit, ->write_inode() will be called
> with
> wbc->unpinned_netfs_wb set when all currently dirty pages have been
> cleaned up
> (see netfs_unpin_writeback()).  netfslib currently uses this to pin
> the
> fscache objects but it could perhaps also be used to pin the
> writeback cap for
> ceph.
> 

Yeah, ci->i_wrbuffer_ref looks like not very reliable programming
pattern and if we can do it in other way, then it could be more safe
solution. However, this counter is used in multiple places of ceph
code. It needs to find a solution to get rid of this counter in safe
and easy way.

> 
> Thirdly, I was under the impression that, for any given page/folio,
> only the
> head snapshot could be altered - and that any older snapshot must be
> flushed
> before we could allow that.
> 
> 
> Fourthly, the ceph_snap_context struct holds a list of snaps.  Does
> it really
> need to, or is just the most recent snap for which the folio holds
> changes
> sufficient?
> 

Let me dive into the implementation details. Maybe, Alex can share more
details here.

Thanks,
Slava.