On Wed, Jul 05, 2023 at 04:17:36PM -0700, Darrick J. Wong wrote: > On Tue, Jun 20, 2023 at 04:01:13PM +1000, Dave Chinner wrote: > > On Mon, Jun 19, 2023 at 09:44:43PM -0700, Darrick J. Wong wrote: > > > On Tue, Jun 20, 2023 at 01:24:10PM +1000, Dave Chinner wrote: > > > > On Thu, May 25, 2023 at 05:44:48PM -0700, Darrick J. Wong wrote: > > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > > > After an online repair, we need to invalidate buffers representing the > > > > > blocks from the old metadata that we're replacing. It's possible that > > > > > parts of a tree that were previously cached in memory are no longer > > > > > accessible due to media failure or other corruption on interior nodes, > > > > > so repair figures out the old blocks from the reverse mapping data and > > > > > scans the buffer cache directly. > > > > > > > > > > Unfortunately, the current buffer cache code triggers asserts if the > > > > > rhashtable lookup finds a non-stale buffer of a different length than > > > > > the key we searched for. For regular operation this is desirable, but > > > > > for this repair procedure, we don't care since we're going to forcibly > > > > > stale the buffer anyway. Add an internal lookup flag to avoid the > > > > > assert. > > > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > --- > > > > > fs/xfs/scrub/reap.c | 2 +- > > > > > fs/xfs/xfs_buf.c | 5 ++++- > > > > > fs/xfs/xfs_buf.h | 10 ++++++++++ > > > > > 3 files changed, 15 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c > > > > > index 30e61315feb0..ca75c22481d2 100644 > > > > > --- a/fs/xfs/scrub/reap.c > > > > > +++ b/fs/xfs/scrub/reap.c > > > > > @@ -149,7 +149,7 @@ xrep_block_reap_binval( > > > > > */ > > > > > error = xfs_buf_incore(sc->mp->m_ddev_targp, > > > > > XFS_FSB_TO_DADDR(sc->mp, fsbno), > > > > > - XFS_FSB_TO_BB(sc->mp, 1), 0, &bp); > > > > > + XFS_FSB_TO_BB(sc->mp, 1), XBF_BCACHE_SCAN, &bp); > > > > > > > > Can't say I'm a big fan of XBF_BCACHE_SCAN as a name - it tells me > > > > nothing about what the incore lookup is actually doing. The actual > > > > lookup action that is being performed is "find any match" rather > > > > than "find exact match". XBF_ANY_MATCH would be a better name, IMO. > > > > > > > > > if (error) > > > > > return; > > > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > > index 15d1e5a7c2d3..b31e6d09a056 100644 > > > > > --- a/fs/xfs/xfs_buf.c > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > @@ -481,7 +481,8 @@ _xfs_buf_obj_cmp( > > > > > * reallocating a busy extent. Skip this buffer and > > > > > * continue searching for an exact match. > > > > > */ > > > > > - ASSERT(bp->b_flags & XBF_STALE); > > > > > + if (!(map->bm_flags & XBM_IGNORE_LENGTH_MISMATCH)) > > > > > + ASSERT(bp->b_flags & XBF_STALE); > > > > > > > > And this becomes XBM_ANY_MATCH, too. > > > > > > Hmmm. I've never come up with a good name for this flag. The caller > > > actually has a *specific* length in mind; it simply doesn't want to trip > > > the assertions on the cached buffers that have a different length but > > > won't be returned by *this* call. > > > > > > If the buffer cache has bufs for daddr 24 len 8 and daddr len 120, the > > > scan calls xfs_buf_get as follows: > > > > > > daddr 24 len 1 (nothing) > > > daddr 24 len 2 (nothing) > > > ... > > > daddr 24 len 8 (finds the first buffer) > > > ... > > > daddr 24 len 120 (finds the second buffer) > > > ... > > > daddr 24 len 132 (nothing) > > > > > > I don't want the scan to ASSERT 130 times, because that muddles the > > > output so badly that it becomes impossible to find relevant debugging > > > messages among the crap. > > > > As I mentioned in the my response to the next patch, this is an > > O(N^2) brute force search. But how do you get two buffers at the > > same address into the cache in the first place? > > /me smacks forehead, realizes that I totally lead you astray here. > What we're scanning for is the case that the buffer cache has two > overlapping buffers with *different* daddrs. > > (That teaches me to reply to emails while on vacation...) > > xreap_agextent_binval is only called if the extent being freed is > completely owned by the data structure and is *not* crosslinked with a > different structure. We've just rebuilt a data structure that was > corrupt in some manner, but the reaper doesn't know the details of that > corruption. Therefore, the reaper should invalidate /all/ buffers that > might span the extent being freed, no matter how they ended up in the > cache. If a deceased data structure thought it was the sole owner of a > single fsblock of space starting at fsblock 16, we ought to look for > buffers (daddr 128, length 1) (128, 2), ... (128, 8), (129, 1), (129, 2) > ... (129, 7) ... (135, 1) and invalidate all of them. Ok, but we can't have random metadata buffers at individual daddr offsets - except for the AG headers, every metadata buffer is filesystem block aligned. Hence in the above example, there is no possibility of having a metadata buffer at sectors 129-135 on a 4kB block size filesystem. > Concrete example: > > So let's pretend that we have an xfs with fs blocksize 4k and dir > blocksize 8k. Let's suppose the directory's data fork maps fsblock 16, > resulting in a buffer (daddr 128, length 16). Let us further suppose > the inobt then allocates fsblock 17, resulting in a buffer (daddr 136, > length 8). These are crosslinked. RIght, that's an intersection of {128,16} and {136,8}. The search region for buffer invalidation is {128, 8} {128, 16} and {136, 8}. They are the only possible buffers that can be cached in that region for a 4kB block size filesystem, as there can be no metadata buffers starting at daddrs 129-135 or 137-143... > First, we discover the inobt is crosslinked with a directory and decide > to rebuild it. We write out the new inobt and go to reap the old > blocks. Since fsblock 17 is crosslinked, we simply remove the OWN_INOBT > rmap record and leave the buffer. > > Aside: Long ago, I tried to make the reaping code invalidate buffers > when the space is crosslinked, but I couldn't figure out how to deal > with the situation where (say) the bmap btrees of two separate forks > think they own the same block. The b_ops will be the same; the buffer > cache doesn't know about the owner field in the block header, and > there's no way to distinguish the blocks for a data fork bmbt vs. an > attr fork bmbt. > > Next we discover that the directory is corrupt and decide to rebuild > that. The directory is now the only owner, so it can actually free the > two fsb of space at startblock 16fsb. Both buffers (128, 16) and (136, > 8) are still in the cache, so it needs to invalidate both. > > Does all /that/ make sense? Yes, it does, and I figured that is waht you were trying to detect, it's just the search pattern you described made no sense. I still think needs cleaning up - it's doing a massive amount of unnecessary work checking for things that cannot exist... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx