Re: XBF_DONE semantics

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 29 Nov 2023 09:42:07 +1100

On Tue, Nov 28, 2023 at 08:58:31AM -0800, Darrick J. Wong wrote:
> On Tue, Nov 28, 2023 at 04:38:08PM +0100, Christoph Hellwig wrote:
> > Hi Darrick,
> > 
> > while reviewing your online repair series I've noticed that the new
> > xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting
> > for the buffer to go off a delwri list, and that reminded me off an
> > assert I saw during my allocator experiments, where
> > xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't
> > have XBF_DONE set because it comes from an ifree transaction (see
> > my current not fully thought out bandaid below).
> 
> LOL, guess what I was looking at yesterday too! :)
> 
> > The way we currently set and check XBF_DONE seems a bit undefined.  The
> > one clear use case is that read uses it to see if a buffer was read in.
> > But places that use buf_get and manually fill in data only use it in a
> > few cases.  Do we need to define clear semantics for it?  Or maybe
> > replace with an XBF_READ_DONE flag for that main read use case and
> > then think what do do with the rest?
> 
> I thought XBF_DONE meant "contents have been read in from disk and
> have passed/will pass verifiers"

Effectively.

> > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> > index 8e886ecfd69a3b..575922c64d4d3a 100644
> > --- a/fs/xfs/xfs_trans_buf.c
> > +++ b/fs/xfs/xfs_trans_buf.c
> > @@ -253,7 +253,6 @@ xfs_trans_read_buf_map(
> >  		ASSERT(bp->b_transp == tp);
> >  		ASSERT(bp->b_log_item != NULL);
> >  		ASSERT(!bp->b_error);
> > -		ASSERT(bp->b_flags & XBF_DONE);
> 
> I don't think this is the right thing to do here -- if the buffer is
> attached to a transaction, it ought to be XBF_DONE.  I think every
> transaction that calls _get_buf and rewrites the buffer contents will
> set XBF_DONE via xfs_trans_dirty_buf, right?

It should, yes. Otherwise the initialisation data in the buffer has
not been logged and may result in writeback and/or recovery being
incorrectly ordered w.r.t. the transaction that initialised the
buffer contents.

> Hmm.  Maybe I'm wrong -- a transaction could bjoin a buffer and then
> call xfs_trans_read_buf_map before dirtying it.  That strikes me as a
> suspicious thing to do, though.

It also seems to me that it doesn't solve the problem of buffer
invalidation followed by a read operation in commit context.

Having though a bit more on this (admittedly they are feverish,
covid-inspired thoughts), I suspect the real problem here is
requiring xfs_imap_to_bp() to ensure we can pin the cluster buffer
in memory whilst there are dirty inodes over it.

If we go back to the days of Irix and the early days of the linux
port, we always pinned the inode cluster buffer in memory whilst we
had an inode active in cache via the inode cluster hash index. We
could do lookups directly on in-memory inode cluster buffers rather
than inodes. Inodes held references on the cluster buffer, and when
the last reference to a cluster buffer went away, it was removed
from the inode cluster hash.

i.e. we never had the inode cluster buffer RMW problem where cached
inodes got dirtied without a cluster buffer already being in memory.

It was simple, and the only downside to it was that it didn't scale.

Hence we got rid of the inode cluster hash index and the pinning of
inode cluster buffers with the introduction of the RCU protected
inode cache based on radix trees. The radix trees were so much more
efficient than a fixed sized cluster hash that we simply did away
with them, and got a memory footprint improvement for read-mostly
inode traversal workloads for free.

Perhaps it is time to revisit that ~15 year old architectural
choice? We've reinstated the pinning for dirty inodes, perhaps we
should just do it unconditionally for all inodes now and reinstate
the direct inode -> cluster buffer relationship we once had.

This has other benefits as well. We can "densify" the XFS clean inode
cache by making VFS inodes unconditionally I_DONT_CACHE, and
simply rely on the xfs buffer cache LRU to hold recently used inode
buffers. It solves several nasty corner cases around inode cluster
buffer freeing and XFS_ISTALE. It allows us to avoid cluster buffer
lookups in the inode logging path. It opens the gate to shared
access to the buffer for flushing dirty inodes to the cluster buffer
without adding new lifecycle problems. It allows significant
efficiency optimisations in managing inode items in the AIL because
lifecycle discrepancies between cluster buffers and inodes go away.

And so on.

So perhaps the best fix here is reinstate the old behaviour of
inodes always pinning the inode cluster buffer in memory and hence
eliding the need for on-demand, log item/IO completion time item
tracking altogether....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx