On Fri, Aug 23, 2019 at 11:53:47AM -0400, Brian Foster wrote: > On Fri, Aug 23, 2019 at 08:23:49AM -0700, Darrick J. Wong wrote: > > On Fri, Aug 23, 2019 at 11:02:21AM -0400, Brian Foster wrote: > > > On Fri, Aug 16, 2019 at 07:06:51PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > > > The inode block mapping scrub function does more work for btree format > > > > extent maps than is absolutely necessary -- first it will walk the bmbt > > > > and check all the entries, and then it will load the incore tree and > > > > check every entry in that tree. > > > > > > > > Reduce the run time of the ondisk bmbt walk if the incore tree is loaded > > > > by checking that the incore tree has an exact match for the bmbt extent. > > > > Similarly, skip the incore tree walk if we have to load it from the > > > > bmbt, since we just checked that. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > --- > > > > fs/xfs/scrub/bmap.c | 40 +++++++++++++++++++++++++++++++++++++--- > > > > 1 file changed, 37 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c > > > > index 1bd29fdc2ab5..6170736fa94f 100644 > > > > --- a/fs/xfs/scrub/bmap.c > > > > +++ b/fs/xfs/scrub/bmap.c > > > > @@ -384,6 +384,7 @@ xchk_bmapbt_rec( > > > > struct xfs_inode *ip = bs->cur->bc_private.b.ip; > > > > struct xfs_buf *bp = NULL; > > > > struct xfs_btree_block *block; > > > > + struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, info->whichfork); > > > > uint64_t owner; > > > > int i; > > > > > > > > @@ -402,8 +403,30 @@ xchk_bmapbt_rec( > > > > } > > > > } > > > > > > > > - /* Set up the in-core record and scrub it. */ > > > > + /* > > > > + * If the incore bmap cache is already loaded, check that it contains > > > > + * an extent that matches this one exactly. We validate those cached > > > > + * bmaps later, so we don't need to check here. > > > > + * > > > > + * If the cache is /not/ loaded, we need to validate the bmbt records > > > > + * now. > > > > + */ > > > > xfs_bmbt_disk_get_all(&rec->bmbt, &irec); > > > > + if (ifp->if_flags & XFS_IFEXTENTS) { > > > > > > ^ looks like whitespace damage right here. > > > > Oops. Fixed. > > > > > > + struct xfs_bmbt_irec iext_irec; > > > > + struct xfs_iext_cursor icur; > > > > + > > > > + if (!xfs_iext_lookup_extent(ip, ifp, irec.br_startoff, &icur, > > > > + &iext_irec) || > > > > + irec.br_startoff != iext_irec.br_startoff || > > > > + irec.br_startblock != iext_irec.br_startblock || > > > > + irec.br_blockcount != iext_irec.br_blockcount || > > > > + irec.br_state != iext_irec.br_state) > > > > + xchk_fblock_set_corrupt(bs->sc, info->whichfork, > > > > + irec.br_startoff); > > > > + return 0; > > > > + } > > > > + > > > > > > Ok, so right now the bmbt walk makes no consideration of in-core state. > > > With this change, we correlate every on-disk record with an in-core > > > counterpart (if cached) and skip the additional extent checks... > > > > > > > return xchk_bmap_extent(ip, bs->cur, info, &irec); > > > > } > > > > > > > > @@ -671,11 +694,22 @@ xchk_bmap( > > > > if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) > > > > goto out; > > > > > > > > - /* Now try to scrub the in-memory extent list. */ > > > > + /* > > > > + * If the incore bmap cache isn't loaded, then this inode has a bmap > > > > + * btree and we already walked it to check all of the mappings. Load > > > > + * the cache now and skip ahead to rmap checking (which requires the > > > > + * bmap cache to be loaded). We don't need to check twice. > > > > + * > > > > + * If the cache /is/ loaded, then we haven't checked any mappings, so > > > > + * iterate the incore cache and check the mappings now, because the > > > > + * bmbt iteration code skipped the checks, assuming that we'd do them > > > > + * here. > > > > + */ > > > > if (!(ifp->if_flags & XFS_IFEXTENTS)) { > > > > error = xfs_iread_extents(sc->tp, ip, whichfork); > > > > if (!xchk_fblock_process_error(sc, whichfork, 0, &error)) > > > > goto out; > > > > + goto out_check_rmap; > > > > > > ... because we end up doing that here. Otherwise, the bmbt walk did the > > > extent checks, so we can skip it here. > > > > Yep. On the stress test case (which is bmapbtd checking of mdrestore'd > > sparse images of large filesystems), only doing the extent walk + check > > once can cut down the runtime by ~30%. > > > > > I think I follow, but I'm a little confused by the need for such split > > > logic when we follow up with an unconditional read of the extent tree > > > anyways. Maybe I'm missing something, but couldn't we just read the > > > extent tree a little earlier and always do the extent checks in one > > > place? > > > > The original goal was that if the extent cache isn't loaded, we want to > > check the bmbt records before we even bother to call xfs_iread_extents, > > so that someone could find out from the trace data exactly where in the > > bmbt was the corruption found. > > > > That certainly makes sense. There's also the line of thought that we > probably shouldn't read the tree if we know the bmbt is corrupted > on-disk (though I don't think anything prevents that from happening > elsewhere). Right. AFAICT the iread_extents will at least catch obvious corruptions in the btree blocks (which makes checking them during scrub a little redundant) but I think the general pattern is that we try to read them and if it fails then we just bounce the error out to userspace. > From a broader scrub performance perspective, wouldn't the > clean inode case dominate performance anyways since we may have to scan > through however many clean inodes before we find corruption? Yes. Well, let's hope so anyway. :) > > Granted, since we're reducing the scrub code to the bare minimum needed > > to decide if something's good or bad due to the primary interface being > > a bit field... I could unconditionally load the extent map earlier, > > unconditionally check the iext records, and then the bmbt walk only > > needs to check that the tree shape is ok and that each bmbt record > > corresponds to an iext record. > > > > Yeah, I think logically it sort of makes sense to 1.) check the bmbt is > safe enough to read 2.) read the extent tree 3.) check the extents via > the extent tree, but the btree scanning iteration mechanism and whatnot > make what you describe above more practical. I think that's a reasonable > approach and makes the current logic easier to read without having to > rework anything major. Hm. But if we load the iext cache earlier then we ought to split the functionality in xchk_bmapbt_rec, since it's called both from the bmbt walk and from the iext walk. The bmbt check function would /only/ verify that the bmbt record has a corresponding iext record; whereas the iext check function would do all the sanity checking and cross referencing that xchk_bmapbt_rec does now. > > The other way to go would be to convert xchk_bmap_check_rmaps to use a > > bmbt cursor if the iext isn't loaded, in which case we wouldn't need to > > load the iext cache at all. That would reduce the kernel slab > > perturbations at a cost of extra code complexity. > > > > Hmm, so that means we wouldn't read in the extent tree, but we'd have to > look up each individual bmbt record from disk in xchk_bmap_check_rmap(), > right? If so, it's not clear to me that's an overall win over the > current implementation, particularly since some other random thing (i.e. > xfs_reflink_inode_has_shared_extents() called via a couple places in > scrub) might trigger reading an extent tree anyways. Perhaps this is > something better considered separately if there's suspected value to > it.. It would be the usual tradeoff between performance vs. memory usage. Meh, I'll just keep loading the iext cache since in theory memory reclaim will just kill off the stuff we've scanned and let go. <cough> --D > Brian > > > Thoughts? > > > > --D > > > > > Brian > > > > > > > } > > > > > > > > /* Find the offset of the last extent in the mapping. */ > > > > @@ -689,7 +723,7 @@ xchk_bmap( > > > > for_each_xfs_iext(ifp, &icur, &irec) { > > > > if (xchk_should_terminate(sc, &error) || > > > > (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) > > > > - break; > > > > + goto out; > > > > if (isnullstartblock(irec.br_startblock)) > > > > continue; > > > > if (irec.br_startoff >= endoff) {