On Mon, Apr 09, 2018 at 10:23:42AM +1000, Dave Chinner wrote: > On Sat, Apr 07, 2018 at 10:55:42AM -0700, Darrick J. Wong wrote: > > On Fri, Apr 06, 2018 at 10:52:51AM +1000, Dave Chinner wrote: > > > On Mon, Apr 02, 2018 at 12:57:18PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > > > Add some helper functions for repair functions that will help us to > > > > allocate and initialize new metadata blocks for btrees that we're > > > > rebuilding. > > > > > > This is more than "helper functions" - my replay is almost 700 lines > > > long! > > > > > > i.e. This is a stack of extent invalidation, freeing and rmap/free > > > space rebuild code. Needs a lot more description and context than > > > "helper functions". > > > > I agree now; this patch has become overly long and incohesive. It could > > be broken into the following pieces: > > > > - modifying transaction allocations to take resblks > > or "ensuring sufficient free space to rebuild" > > > > - rolling transactions > > > > - allocation and reinit functions > > > > - fixing/putting on the freelist > > > > - dealing with leftover blocks after we rebuild a tree > > > > - finding btree roots > > > > - resetting superblock counters > > > > So I'll break this up into seven smaller pieces. > > Thanks! > > > > > > ..... > > > > @@ -574,7 +575,10 @@ xfs_scrub_setup_fs( > > > > struct xfs_scrub_context *sc, > > > > struct xfs_inode *ip) > > > > { > > > > - return xfs_scrub_trans_alloc(sc->sm, sc->mp, &sc->tp); > > > > + uint resblks; > > > > + > > > > + resblks = xfs_repair_calc_ag_resblks(sc); > > > > + return xfs_scrub_trans_alloc(sc->sm, sc->mp, resblks, &sc->tp); > > > > } > > > > > > What's the block reservation for? > > > > We rebuild a metadata structure by constructing a completely new btree > > with blocks we got from the free space and switching the root when we're > > done. This isn't treated any differently from any other btree shape > > change, so we need to reserve blocks in the transaction. > > Can you put this in a comment? Added to the comment above xfs_scrub_trans_alloc. > > > > + args.tp = sc->tp; > > > > + args.mp = sc->mp; > > > > + args.oinfo = *oinfo; > > > > + args.fsbno = XFS_AGB_TO_FSB(args.mp, sc->sa.agno, 0); > > > > > > So our target BNO is the start of the AG. > > > > > > > + args.minlen = 1; > > > > + args.maxlen = 1; > > > > + args.prod = 1; > > > > + args.type = XFS_ALLOCTYPE_NEAR_BNO; > > > > > > And we do a "near" search" for a single block. i.e. we are looking > > > for a single block near to the start of the AG. Basically, we are > > > looking for the extent at the left edge of the by-bno tree, which > > > may not be a single block. > > > > > > Would it be better to do a XFS_ALLOCTYPE_THIS_AG allocation here so > > > we look up the by-size btree for a single block extent? The left > > > edge of that tree will always be the smallest free extent closest to > > > the start of the AG, and so using THIS_AG will tend to fill > > > small freespace holes rather than tear up larger free spaces for > > > single block allocations..... > > > > Makes sense. Originally I was thinking that we try to put the rebuilt > > btree as close to the start of the AG as possible, but there's little > > point in doing that so long as regular btree block allocation doesn't > > bother. > > > > TBH I've been wondering on the side if it makes more sense for us to > > detect things like dm-{thinp,cache} which have large(ish) underlying > > blocks and try to consolidate metadata into a smaller number of those > > underlying blocks at the start (or the end) of the AG? It could be > > useful to pack the metadata into a smaller number of underlying blocks > > so that if one piece of metadata gets hot enough the rest will end up in > > the fast cache as well. > > Separate issue, probably can be done with an in-memory threshold > for metadata allocation (e.g. search for BNO < 5% from start of AG). > I've thought about doing somthing like that (reserved area for > metadata) for some time, but I've never really seen signficant > advantages to doing it... <nod> I suppose on a spinny disk it might cut down scrub/repair time, though maybe for pmem it'll prove advantageous to try to keep free space as unfragmented as possible to increase the likelihood of large page use. Oh well, that's a research question for another thread. :) > > > > +/* Put a block back on the AGFL. */ > > > > +int > > > > +xfs_repair_put_freelist( > > > > + struct xfs_scrub_context *sc, > > > > + xfs_agblock_t agbno) > > > > +{ > > > > + struct xfs_owner_info oinfo; > > > > + int error; > > > > + > > > > + /* > > > > + * Since we're "freeing" a lost block onto the AGFL, we have to > > > > + * create an rmap for the block prior to merging it or else other > > > > + * parts will break. > > > > + */ > > > > + xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG); > > > > + error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno, agbno, 1, > > > > + &oinfo); > > > > + if (error) > > > > + return error; > > > > + > > > > + /* Put the block on the AGFL. */ > > > > + error = xfs_alloc_put_freelist(sc->tp, sc->sa.agf_bp, sc->sa.agfl_bp, > > > > + agbno, 0); > > > > > > At what point do we check there's actually room in the AGFL for this > > > block? > > > > > > > + if (error) > > > > + return error; > > > > + xfs_extent_busy_insert(sc->tp, sc->sa.agno, agbno, 1, > > > > + XFS_EXTENT_BUSY_SKIP_DISCARD); > > > > + > > > > + /* Make sure the AGFL doesn't overfill. */ > > > > + return xfs_repair_fix_freelist(sc, true); > > > > > > i.e. shouldn't this be done first? > > > > > > > > ??? That was supposed to be "Yes". > > > > > + else if (resv == XFS_AG_RESV_AGFL) > > > > + error = xfs_repair_put_freelist(sc, agbno); > > > > + else > > > > + error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv); > > > > + if (error) > > > > + break; > > > > + > > > > + if (sc->ip) > > > > + error = xfs_trans_roll_inode(&sc->tp, sc->ip); > > > > + else > > > > + error = xfs_repair_roll_ag_trans(sc); > > > > > > why don't we break on error here? > > > > We do, but it's up in the for loop. > > I missed that completely. I'd much prefer we don't hide "break on > error" conditions inside the for loop machinery because it is so > easy to miss... I refactored the loop body into a separate helper, which decreased the indentation and made the loop control clearer. > > > > > + > > > > + trace_xfs_repair_collect_btree_extent(sc->mp, > > > > + XFS_FSB_TO_AGNO(sc->mp, fsbno), > > > > + XFS_FSB_TO_AGBNO(sc->mp, fsbno), len); > > > > + > > > > + rae = kmem_alloc(sizeof(struct xfs_repair_extent), > > > > + KM_MAYFAIL | KM_NOFS); > > > > > > Why KM_NOFS? > > > > The caller holds a transaction and (potentially) has locked an entire > > AG, so we want to avoid recursing into the filesystem to free up memory. > > If we're in transaction context, we are already under a KM_NOFS > allocation contexts. SO it shouldn't be needed here - if we are > called outside transaction context, then we need a comment > explaining it (as opposed to using KM_NOFS to shut up lockdep). Ok, I'll get rid of the KM_NOFS in this and all the other patches. (I admit I probably /have/ been stuffing in KM_NOFS to shut up lockdep...) > > > > + for_each_xfs_repair_extent_safe(rae, n, exlist) { > > > > + agno = XFS_FSB_TO_AGNO(sc->mp, rae->fsbno); > > > > + agbno = XFS_FSB_TO_AGBNO(sc->mp, rae->fsbno); > > > > + for (i = 0; i < rae->len; i++) { > > > > + bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno, > > > > + agbno + i, 0); > > > > + xfs_trans_binval(sc->tp, bp); > > > > + } > > > > + } > > > > > > What if they are data blocks we are dumping? xfs_trans_binval is > > > fine for metadata blocks, but creating a stale metadata buffer and > > > logging a buffer cancel for user data blocks doesn't make any sense > > > to me.... > > > > exlist is only supposed to contain references to metadata blocks that we > > collected from the rmapbt. The current iteration of the repair code > > shouldn't ever dump file data blocks. > > That needs explanation, then. I had no idea that exlist was only for > metadata blocks... Ok, comment changed to: /* * Invalidate buffers for per-AG btree blocks we're dumping. We assume * that exlist points only to metadata blocks. */ > > > > +/* Remove all the blocks in sublist from exlist. */ > > > > +#define LEFT_CONTIG (1 << 0) > > > > +#define RIGHT_CONTIG (1 << 1) > > > > > > Namespace, please. > > > > > > > +int > > > > +xfs_repair_subtract_extents( > > > > + struct xfs_scrub_context *sc, > > > > + struct xfs_repair_extent_list *exlist, > > > > + struct xfs_repair_extent_list *sublist) > > > > > > What is this function supposed to do? I can't really review the code > > > (looks complex) because I don't know what it's function is supposed > > > to be. > > > > The intent is that we iterate the rmapbt for all of its records for a > > given owner to generate exlist. Then we iterate all blocks of metadata > > structures that we are not rebuilding but have the same rmapbt owner to > > generate sublist. This routine subtracts all the blocks mentioned in > > sublist from all the extents linked in exlist, which leaves exlist as > > the list of all blocks owned by the old version of btree that we're > > rebuilding. The list is fed into _reap_btree_extents to free the old > > btree blocks (or merely remove the rmap if the block is crosslinked). > > Comment, please :) Added. > > > > + > > > > + if (delta_icount) { > > > > + error = xfs_mod_icount(mp, delta_icount); > > > > + if (error) > > > > + goto out; > > > > + } > > > > + if (delta_ifree) { > > > > + error = xfs_mod_ifree(mp, delta_ifree); > > > > + if (error) > > > > + goto out; > > > > + } > > > > + if (delta_fdblocks) { > > > > + error = xfs_mod_fdblocks(mp, delta_fdblocks, false); > > > > + if (error) > > > > + goto out; > > > > + } > > > > + > > > > +out: > > > > + xfs_trans_cancel(tp); > > > > + return error; > > > > > > Ummmm - why do we do all this superblock modification work in a > > > transaction context, and then just cancel it? > > > > It's an empty transaction, so nothing gets dirty and nothing needs > > committing. TBH I don't even think this is necessary, since we only use > > it for reading the agi/agf, and for that we can certainly _buf_relse > > instead of letting the _trans_cancel do it. > > Yeah, it'd be better to get rid of the transaction context if we > can. It should also explain how it avoids races with ongoing > superblock counter updates... <nod> AFAICT it's a simple matter of taking m_sb_lock to update the m_sb counters, after which we unlock and do the percpu counters. > > > > + inobt_sz = xfs_iallocbt_calc_size(mp, icount / > > > > + XFS_INODES_PER_CHUNK); > > > > + if (xfs_sb_version_hasfinobt(&mp->m_sb)) > > > > + inobt_sz *= 2; > > > > + if (xfs_sb_version_hasreflink(&mp->m_sb)) { > > > > + rmapbt_sz = xfs_rmapbt_calc_size(mp, aglen); > > > > + refcbt_sz = xfs_refcountbt_calc_size(mp, usedlen); > > > > + } else { > > > > + rmapbt_sz = xfs_rmapbt_calc_size(mp, usedlen); > > > > + refcbt_sz = 0; > > > > + } > > > > + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) > > > > + rmapbt_sz = 0; > > > > + > > > > + trace_xfs_repair_calc_ag_resblks_btsize(mp, sm->sm_agno, bnobt_sz, > > > > + inobt_sz, rmapbt_sz, refcbt_sz); > > > > + > > > > + return max(max(bnobt_sz, inobt_sz), max(rmapbt_sz, refcbt_sz)); > > > > > > Why are we only returning the resblks for a single tree rebuild? > > > Shouldn't it return the total blocks require to rebuild all all the > > > trees? > > > > A repair call only ever rebuilds one btree in one AG, so I don't see why > > we'd need to reserve space to rebuild all the btrees in the AG (or the > > same btree in all AGs)... though it's possible that I don't understand > > the question? > > It's not clear that we are only doing a reservation for a single > tree in an AG. needs a comment to explain the context the > reservation is being made for. /* * Figure out how many blocks to reserve for an AG repair. We calculate * the worst case estimate for the number of blocks we'd need to rebuild * one of any type of per-AG btree. */ --D > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html