Re: [PATCH 7/9] xfs: ignore stale buffers when scanning the buffer cache

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 10 Jul 2023 09:15:06 +1000

On Wed, Jul 05, 2023 at 04:17:36PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 20, 2023 at 04:01:13PM +1000, Dave Chinner wrote:
> > On Mon, Jun 19, 2023 at 09:44:43PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jun 20, 2023 at 01:24:10PM +1000, Dave Chinner wrote:
> > > > On Thu, May 25, 2023 at 05:44:48PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > > 
> > > > > After an online repair, we need to invalidate buffers representing the
> > > > > blocks from the old metadata that we're replacing.  It's possible that
> > > > > parts of a tree that were previously cached in memory are no longer
> > > > > accessible due to media failure or other corruption on interior nodes,
> > > > > so repair figures out the old blocks from the reverse mapping data and
> > > > > scans the buffer cache directly.
> > > > > 
> > > > > Unfortunately, the current buffer cache code triggers asserts if the
> > > > > rhashtable lookup finds a non-stale buffer of a different length than
> > > > > the key we searched for.  For regular operation this is desirable, but
> > > > > for this repair procedure, we don't care since we're going to forcibly
> > > > > stale the buffer anyway.  Add an internal lookup flag to avoid the
> > > > > assert.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/xfs/scrub/reap.c |    2 +-
> > > > >  fs/xfs/xfs_buf.c    |    5 ++++-
> > > > >  fs/xfs/xfs_buf.h    |   10 ++++++++++
> > > > >  3 files changed, 15 insertions(+), 2 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
> > > > > index 30e61315feb0..ca75c22481d2 100644
> > > > > --- a/fs/xfs/scrub/reap.c
> > > > > +++ b/fs/xfs/scrub/reap.c
> > > > > @@ -149,7 +149,7 @@ xrep_block_reap_binval(
> > > > >  	 */
> > > > >  	error = xfs_buf_incore(sc->mp->m_ddev_targp,
> > > > >  			XFS_FSB_TO_DADDR(sc->mp, fsbno),
> > > > > -			XFS_FSB_TO_BB(sc->mp, 1), 0, &bp);
> > > > > +			XFS_FSB_TO_BB(sc->mp, 1), XBF_BCACHE_SCAN, &bp);
> > > > 
> > > > Can't say I'm a big fan of XBF_BCACHE_SCAN as a name - it tells me
> > > > nothing about what the incore lookup is actually doing. The actual
> > > > lookup action that is being performed is "find any match" rather
> > > > than "find exact match". XBF_ANY_MATCH would be a better name, IMO.
> > > > 
> > > > >  	if (error)
> > > > >  		return;
> > > > >  
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index 15d1e5a7c2d3..b31e6d09a056 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -481,7 +481,8 @@ _xfs_buf_obj_cmp(
> > > > >  		 * reallocating a busy extent. Skip this buffer and
> > > > >  		 * continue searching for an exact match.
> > > > >  		 */
> > > > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > > > +		if (!(map->bm_flags & XBM_IGNORE_LENGTH_MISMATCH))
> > > > > +			ASSERT(bp->b_flags & XBF_STALE);
> > > > 
> > > > And this becomes XBM_ANY_MATCH, too.
> > > 
> > > Hmmm.  I've never come up with a good name for this flag.  The caller
> > > actually has a *specific* length in mind; it simply doesn't want to trip
> > > the assertions on the cached buffers that have a different length but
> > > won't be returned by *this* call.
> > > 
> > > If the buffer cache has bufs for daddr 24 len 8 and daddr len 120, the
> > > scan calls xfs_buf_get as follows:
> > > 
> > > daddr 24 len 1 (nothing)
> > > daddr 24 len 2 (nothing)
> > > ...
> > > daddr 24 len 8 (finds the first buffer)
> > > ...
> > > daddr 24 len 120 (finds the second buffer)
> > > ...
> > > daddr 24 len 132 (nothing)
> > > 
> > > I don't want the scan to ASSERT 130 times, because that muddles the
> > > output so badly that it becomes impossible to find relevant debugging
> > > messages among the crap.
> > 
> > As I mentioned in the my response to the next patch, this is an
> > O(N^2) brute force search. But how do you get two buffers at the
> > same address into the cache in the first place?
> 
> /me smacks forehead, realizes that I totally lead you astray here.
> What we're scanning for is the case that the buffer cache has two
> overlapping buffers with *different* daddrs.
> 
> (That teaches me to reply to emails while on vacation...)
> 
> xreap_agextent_binval is only called if the extent being freed is
> completely owned by the data structure and is *not* crosslinked with a
> different structure.  We've just rebuilt a data structure that was
> corrupt in some manner, but the reaper doesn't know the details of that
> corruption.  Therefore, the reaper should invalidate /all/ buffers that
> might span the extent being freed, no matter how they ended up in the
> cache.  If a deceased data structure thought it was the sole owner of a
> single fsblock of space starting at fsblock 16, we ought to look for
> buffers (daddr 128, length 1) (128, 2), ... (128, 8), (129, 1), (129, 2)
> ...  (129, 7) ... (135, 1) and invalidate all of them.

Ok, but we can't have random metadata buffers at individual daddr
offsets - except for the AG headers, every metadata buffer is
filesystem block aligned. Hence in the above example, there is no
possibility of having a metadata buffer at sectors 129-135 on a 4kB
block size filesystem.

> Concrete example:
> 
> So let's pretend that we have an xfs with fs blocksize 4k and dir
> blocksize 8k.  Let's suppose the directory's data fork maps fsblock 16,
> resulting in a buffer (daddr 128, length 16).  Let us further suppose
> the inobt then allocates fsblock 17, resulting in a buffer (daddr 136,
> length 8).  These are crosslinked.

RIght, that's an intersection of {128,16} and {136,8}. The search
region for buffer invalidation is {128, 8} {128, 16} and {136, 8}.
They are the only possible buffers that can be cached in that region
for a 4kB block size filesystem, as there can be no metadata buffers
starting at daddrs 129-135 or 137-143...

> First, we discover the inobt is crosslinked with a directory and decide
> to rebuild it.  We write out the new inobt and go to reap the old
> blocks.  Since fsblock 17 is crosslinked, we simply remove the OWN_INOBT
> rmap record and leave the buffer.
> 
> Aside: Long ago, I tried to make the reaping code invalidate buffers
> when the space is crosslinked, but I couldn't figure out how to deal
> with the situation where (say) the bmap btrees of two separate forks
> think they own the same block.  The b_ops will be the same; the buffer
> cache doesn't know about the owner field in the block header, and
> there's no way to distinguish the blocks for a data fork bmbt vs. an
> attr fork bmbt.
> 
> Next we discover that the directory is corrupt and decide to rebuild
> that.  The directory is now the only owner, so it can actually free the
> two fsb of space at startblock 16fsb.  Both buffers (128, 16) and (136,
> 8) are still in the cache, so it needs to invalidate both.
> 
> Does all /that/ make sense?

Yes, it does, and I figured that is waht you were trying to detect,
it's just the search pattern you described made no sense.

I still think needs cleaning up - it's doing a massive amount of
unnecessary work checking for things that cannot exist...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx