Re: XBF_DONE semantics

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 29 Nov 2023 08:08:37 +1100

On Tue, Nov 28, 2023 at 04:38:08PM +0100, Christoph Hellwig wrote:
> Hi Darrick,
> 
> while reviewing your online repair series I've noticed that the new
> xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting
> for the buffer to go off a delwri list, and that reminded me off an
> assert I saw during my allocator experiments, where
> xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't
> have XBF_DONE set because it comes from an ifree transaction (see
> my current not fully thought out bandaid below).

I'll come back to the bug later, because I know what it is and I
just haven't had time to fix it yet. I'll address XBF_DONE first.

XBF_DONE means the data in the buffer is valid. It's equivalent to
the uptodate bit in a folio. It has no other meaning.

> The way we currently set and check XBF_DONE seems a bit undefined.  The
> one clear use case is that read uses it to see if a buffer was read in.

Yes.

> But places that use buf_get and manually fill in data only use it in a
> few cases. 

Yes. the caller of buf_get always needs to set XBF_DONE if it is
initialising a new buffer ready for it to be written. It should be
done before the caller drops the buffer lock so that no other lookup
can see the buffer in the state of "contains valid data but does not
have XBF_DONE set".

Also, there are cases where we use buf_get but we don't care about
the contents being initialised because we are invalidating
the buffer and need the buffer+buf_log_item to log the invalidation
to the journal.

In these cases we just don't care that the contents
are valid, because xfs_trans_binval() calls xfs_buf_stale() to
invalidate the contents and that removes the XBF_DONE flag. We do
this in places like inode chunk removal to invalidate the cluster
buffers to ensure they are written back after then chunk has been
freed.

.... and this brings us to the bug that you mentioned about an ifree
transaction leaving an inode cluster buffer in cache without
XBF_DONE set....

The issue is xfs_inode_item_precommit() attaching inodes to the
cluster buffer. In the old days before we delayed inode logging to
the end of the ifree transaction, the order was:

xfs_ifree
  xfs_difree(ip)
    xfs_imap_to_bp()
      xfs_trans_buf_read()
    xfs_trans_brelse()
  xfs_trans_log_inode(ip)
  xfs_ifree_cluster(ip)
    for each incore inode {
      set XFS_ISTALE
    }
    for each cluster buffer {
      xfs_trans_buf_get()
      xfs_trans_binval()
    }
  xfs_trans_commit()

IOWs, the attachment of the inode to the cluster buffer in
xfs_trans_log_inode() occurred before both the inode was marked
XFS_ISTALE and the cluster buffer was marked XBF_STALE and XBF_DONE
was removed from it. Hence the lookup in xfs_difree() always found a
valid XBF_DONE buffer.

With the fixes for AGF->AGI->inode cluster buffer locking order done
a while back, we moved the processing that was done in
xfs_trans_log_inode() to xfs_inode_item_precommit(), which is called
from xfs_trans_commit(). This moved the xfs_imap_to_bp() call when
logging th einode from before the cluster invalidation to after it.

The result is that imap_to_bp() now finds the inode cluster buffer
in memory (as it should), but it has been marked stale (correctly!)
and xfs_trans_buf_read_map() freaks out over that (again,
correctly!).

The key to triggering this situation is that the inode cluster
buffer needs to be written back between the unlink() syscall and the
inactivation processing that frees the cluster buffer. Inode cluster
buffer IO completion removes the inodes from the cluster buffer, so
when they are next dirtied they need to be re-added. If this inode
cluster buffer writeback coincides with the transaction removing of
the last inode from an inode chunk and hence freeing the inode
chunk, we end up with this stiuation occurring and assert failures
in xfs_trans_read_buf_map().

So, like I said: I know what the bug is, I worked it out from the
one time one of my test machines tripped over it about 4 weeks ago,
but I just haven't had the time since then to work out a fix.

I suspect that we can check XFS_ISTALE in xfs_inode_item_precommit()
and do something different, but I'd much prefer that the inode still
gets added to the inode cluster buffer and cleaned up with all the
other XFS_ISTALE indoes on the cluster buffer at journal commit
completion time. Maybe we can pass a new flag to xfs_imap_to_bp() to
say "stale buffer ok here" or something similar, because I really
want the general case of xfs_trans_buf_read_map() to fail loudly if
a buffer without XBF_DONE is returned....

> Do we need to define clear semantics for it?  Or maybe
> replace with an XBF_READ_DONE flag for that main read use case and
> then think what do do with the rest?

To me, the semantics of XBF_DONE are pretty clear. Apart from fixing
the bug you are seeing, I'm not sure that anything really needs to
change....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx