On Mon, Jul 10, 2023 at 09:15:06AM +1000, Dave Chinner wrote: > On Wed, Jul 05, 2023 at 04:17:36PM -0700, Darrick J. Wong wrote: > > On Tue, Jun 20, 2023 at 04:01:13PM +1000, Dave Chinner wrote: > > > On Mon, Jun 19, 2023 at 09:44:43PM -0700, Darrick J. Wong wrote: > > > > On Tue, Jun 20, 2023 at 01:24:10PM +1000, Dave Chinner wrote: > > > > > On Thu, May 25, 2023 at 05:44:48PM -0700, Darrick J. Wong wrote: > > > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > > > > > After an online repair, we need to invalidate buffers representing the > > > > > > blocks from the old metadata that we're replacing. It's possible that > > > > > > parts of a tree that were previously cached in memory are no longer > > > > > > accessible due to media failure or other corruption on interior nodes, > > > > > > so repair figures out the old blocks from the reverse mapping data and > > > > > > scans the buffer cache directly. > > > > > > > > > > > > Unfortunately, the current buffer cache code triggers asserts if the > > > > > > rhashtable lookup finds a non-stale buffer of a different length than > > > > > > the key we searched for. For regular operation this is desirable, but > > > > > > for this repair procedure, we don't care since we're going to forcibly > > > > > > stale the buffer anyway. Add an internal lookup flag to avoid the > > > > > > assert. > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > --- > > > > > > fs/xfs/scrub/reap.c | 2 +- > > > > > > fs/xfs/xfs_buf.c | 5 ++++- > > > > > > fs/xfs/xfs_buf.h | 10 ++++++++++ > > > > > > 3 files changed, 15 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c > > > > > > index 30e61315feb0..ca75c22481d2 100644 > > > > > > --- a/fs/xfs/scrub/reap.c > > > > > > +++ b/fs/xfs/scrub/reap.c > > > > > > @@ -149,7 +149,7 @@ xrep_block_reap_binval( > > > > > > */ > > > > > > error = xfs_buf_incore(sc->mp->m_ddev_targp, > > > > > > XFS_FSB_TO_DADDR(sc->mp, fsbno), > > > > > > - XFS_FSB_TO_BB(sc->mp, 1), 0, &bp); > > > > > > + XFS_FSB_TO_BB(sc->mp, 1), XBF_BCACHE_SCAN, &bp); > > > > > > > > > > Can't say I'm a big fan of XBF_BCACHE_SCAN as a name - it tells me > > > > > nothing about what the incore lookup is actually doing. The actual > > > > > lookup action that is being performed is "find any match" rather > > > > > than "find exact match". XBF_ANY_MATCH would be a better name, IMO. > > > > > > > > > > > if (error) > > > > > > return; > > > > > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > > > index 15d1e5a7c2d3..b31e6d09a056 100644 > > > > > > --- a/fs/xfs/xfs_buf.c > > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > > @@ -481,7 +481,8 @@ _xfs_buf_obj_cmp( > > > > > > * reallocating a busy extent. Skip this buffer and > > > > > > * continue searching for an exact match. > > > > > > */ > > > > > > - ASSERT(bp->b_flags & XBF_STALE); > > > > > > + if (!(map->bm_flags & XBM_IGNORE_LENGTH_MISMATCH)) > > > > > > + ASSERT(bp->b_flags & XBF_STALE); > > > > > > > > > > And this becomes XBM_ANY_MATCH, too. > > > > > > > > Hmmm. I've never come up with a good name for this flag. The caller > > > > actually has a *specific* length in mind; it simply doesn't want to trip > > > > the assertions on the cached buffers that have a different length but > > > > won't be returned by *this* call. > > > > > > > > If the buffer cache has bufs for daddr 24 len 8 and daddr len 120, the > > > > scan calls xfs_buf_get as follows: > > > > > > > > daddr 24 len 1 (nothing) > > > > daddr 24 len 2 (nothing) > > > > ... > > > > daddr 24 len 8 (finds the first buffer) > > > > ... > > > > daddr 24 len 120 (finds the second buffer) > > > > ... > > > > daddr 24 len 132 (nothing) > > > > > > > > I don't want the scan to ASSERT 130 times, because that muddles the > > > > output so badly that it becomes impossible to find relevant debugging > > > > messages among the crap. > > > > > > As I mentioned in the my response to the next patch, this is an > > > O(N^2) brute force search. But how do you get two buffers at the > > > same address into the cache in the first place? > > > > /me smacks forehead, realizes that I totally lead you astray here. > > What we're scanning for is the case that the buffer cache has two > > overlapping buffers with *different* daddrs. > > > > (That teaches me to reply to emails while on vacation...) > > > > xreap_agextent_binval is only called if the extent being freed is > > completely owned by the data structure and is *not* crosslinked with a > > different structure. We've just rebuilt a data structure that was > > corrupt in some manner, but the reaper doesn't know the details of that > > corruption. Therefore, the reaper should invalidate /all/ buffers that > > might span the extent being freed, no matter how they ended up in the > > cache. If a deceased data structure thought it was the sole owner of a > > single fsblock of space starting at fsblock 16, we ought to look for > > buffers (daddr 128, length 1) (128, 2), ... (128, 8), (129, 1), (129, 2) > > ... (129, 7) ... (135, 1) and invalidate all of them. > > Ok, but we can't have random metadata buffers at individual daddr > offsets - except for the AG headers, every metadata buffer is > filesystem block aligned. Hence in the above example, there is no > possibility of having a metadata buffer at sectors 129-135 on a 4kB > block size filesystem. > > > Concrete example: > > > > So let's pretend that we have an xfs with fs blocksize 4k and dir > > blocksize 8k. Let's suppose the directory's data fork maps fsblock 16, > > resulting in a buffer (daddr 128, length 16). Let us further suppose > > the inobt then allocates fsblock 17, resulting in a buffer (daddr 136, > > length 8). These are crosslinked. > > RIght, that's an intersection of {128,16} and {136,8}. The search > region for buffer invalidation is {128, 8} {128, 16} and {136, 8}. > They are the only possible buffers that can be cached in that region > for a 4kB block size filesystem, as there can be no metadata buffers > starting at daddrs 129-135 or 137-143... <nod> There's nothing in this flag that *prevents* someone from spending a lot of time pounding on the buffer cache with a per-daddr scan. But note that the next patch only ever calls it with fsblock-aligned daddr and length values. So in the end, the only xfs_buf queries will indded be for {128, 8} {128, 16} and {136, 8} like you said. See the next patch for the actual usage. Sorry that I've been unclear about all this and compounding it with not very good examples. It's been a /very/ long time since I actually touched this code (this rewrite has been waiting for merge since ... 2019?) and I'm basically coming in cold. :( Ultimately I make invalidation scan code look like this: /* Buffer cache scan context. */ struct xrep_bufscan { /* Disk address for the buffers we want to scan. */ xfs_daddr_t daddr; /* Maximum number of sectors to scan. */ xfs_daddr_t max_sectors; /* Each round, increment the search length by this number of sectors. */ xfs_daddr_t daddr_step; /* Internal scan state; initialize to zero. */ xfs_daddr_t __sector_count; }; /* * Return an incore buffer from a sector scan, or NULL if there are no buffers * left to return. */ struct xfs_buf * xrep_bufscan_advance( struct xfs_mount *mp, struct xrep_bufscan *scan) { scan->__sector_count += scan->daddr_step; while (scan->__sector_count <= scan->max_sectors) { struct xfs_buf *bp = NULL; int error; error = xfs_buf_incore(mp->m_ddev_targp, scan->daddr, scan->__sector_count, XBF_LIVESCAN, &bp); if (!error) return bp; scan->__sector_count += scan->daddr_step; } return NULL; } /* Try to invalidate the incore buffers for an extent that we're freeing. */ STATIC void xreap_agextent_binval( struct xreap_state *rs, xfs_agblock_t agbno, xfs_extlen_t *aglenp) { struct xfs_scrub *sc = rs->sc; struct xfs_perag *pag = sc->sa.pag; struct xfs_mount *mp = sc->mp; xfs_agnumber_t agno = sc->sa.pag->pag_agno; xfs_agblock_t agbno_next = agbno + *aglenp; xfs_agblock_t bno = agbno; /* * Avoid invalidating AG headers and post-EOFS blocks because we never * own those. */ if (!xfs_verify_agbno(pag, agbno) || !xfs_verify_agbno(pag, agbno_next - 1)) return; /* * If there are incore buffers for these blocks, invalidate them. We * assume that the lack of any other known owners means that the buffer * can be locked without risk of deadlocking. The buffer cache cannot * detect aliasing, so employ nested loops to scan for incore buffers * of any plausible size. */ while (bno < agbno_next) { struct xrep_bufscan scan = { .daddr = XFS_AGB_TO_DADDR(mp, agno, bno), .max_sectors = xrep_bufscan_max_sectors(mp, agbno_next - bno), .daddr_step = XFS_FSB_TO_BB(mp, 1), }; struct xfs_buf *bp; while ((bp = xrep_bufscan_advance(mp, &scan)) != NULL) { xfs_trans_bjoin(sc->tp, bp); xfs_trans_binval(sc->tp, bp); rs->invalidated++; /* * Stop invalidating if we've hit the limit; we should * still have enough reservation left to free however * far we've gotten. */ if (rs->invalidated > XREAP_MAX_BINVAL) { *aglenp -= agbno_next - bno; goto out; } } bno++; } out: trace_xreap_agextent_binval(sc->sa.pag, agbno, *aglenp); } I hope that makes it clearer? --D > > First, we discover the inobt is crosslinked with a directory and decide > > to rebuild it. We write out the new inobt and go to reap the old > > blocks. Since fsblock 17 is crosslinked, we simply remove the OWN_INOBT > > rmap record and leave the buffer. > > > > Aside: Long ago, I tried to make the reaping code invalidate buffers > > when the space is crosslinked, but I couldn't figure out how to deal > > with the situation where (say) the bmap btrees of two separate forks > > think they own the same block. The b_ops will be the same; the buffer > > cache doesn't know about the owner field in the block header, and > > there's no way to distinguish the blocks for a data fork bmbt vs. an > > attr fork bmbt. > > > > Next we discover that the directory is corrupt and decide to rebuild > > that. The directory is now the only owner, so it can actually free the > > two fsb of space at startblock 16fsb. Both buffers (128, 16) and (136, > > 8) are still in the cache, so it needs to invalidate both. > > > > Does all /that/ make sense? > > Yes, it does, and I figured that is waht you were trying to detect, > it's just the search pattern you described made no sense. > > I still think needs cleaning up - it's doing a massive amount of > unnecessary work checking for things that cannot exist... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx