Re: Ceph and Netfslib

David Howells <dhowells@xxxxxxxxxx> · Wed, 18 Dec 2024 19:48:06 +0000

Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote:

> > Firstly, note that there may be a bug in ceph writeback cleanup as it
> > stands.
> > It calls folio_detach_private() without holding the folio lock (it
> > holds the
> > writeback lock, but that's not sufficient by MM rules).  This means
> > you have a
> > race between { setting ->private, setting PG_private and inc refcount
> > } on one
> > hand and { clearing ->private, clearing PG_private and dec refcount }
> > on the
> > other.
> > 
> 
> I assume you imply ceph_invalidate_folio() method. Am I correct here?

Actually, no, writepages_finish() is the culprit.

ceph_invalidate_folio() is called with the folio locked and can freely wrangle
folio->private.

> > Secondly, there's a counter, ci->i_wrbuffer_ref, that might actually be
> > redundant if we do it right as I_PINNING_NETFS_WB offers an alternative
> > way we might do things.  If we set this bit, ->write_inode() will be
> > called with wbc->unpinned_netfs_wb set when all currently dirty pages have
> > been cleaned up (see netfs_unpin_writeback()).  netfslib currently uses
> > this to pin the fscache objects but it could perhaps also be used to pin
> > the writeback cap for ceph.
> 
> Yeah, ci->i_wrbuffer_ref looks like not very reliable programming
> pattern and if we can do it in other way, then it could be more safe
> solution. However, this counter is used in multiple places of ceph
> code. It needs to find a solution to get rid of this counter in safe
> and easy way.
> 
> > 
> > Thirdly, I was under the impression that, for any given page/folio,
> > only the
> > head snapshot could be altered - and that any older snapshot must be
> > flushed
> > before we could allow that.
> > 
> > 
> > Fourthly, the ceph_snap_context struct holds a list of snaps.  Does
> > it really
> > need to, or is just the most recent snap for which the folio holds
> > changes
> > sufficient?
> > 
> 
> Let me dive into the implementation details. Maybe, Alex can share more
> details here.

Thanks.

David