On Tue, May 15, 2018 at 03:34:04PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > Now that we've plumbed in the ability to construct a list of dead btree > blocks following a repair, add more helpers to dispose of them. This is > done by examining the rmapbt -- if the btree was the only owner we can > free the block, otherwise it's crosslinked and we can only remove the > rmapbt record. > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > --- > fs/xfs/scrub/repair.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/scrub/repair.h | 3 + > 2 files changed, 202 insertions(+), 1 deletion(-) > > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c > index 8e8ecddd7537..d820e01d1146 100644 > --- a/fs/xfs/scrub/repair.c > +++ b/fs/xfs/scrub/repair.c > @@ -362,6 +362,173 @@ xfs_repair_init_btblock( > return 0; > } > > +/* Ensure the freelist is the correct size. */ > +int > +xfs_repair_fix_freelist( > + struct xfs_scrub_context *sc, > + bool can_shrink) > +{ > + struct xfs_alloc_arg args = {0}; > + int error; > + > + args.mp = sc->mp; > + args.tp = sc->tp; > + args.agno = sc->sa.agno; > + args.alignment = 1; > + args.pag = xfs_perag_get(args.mp, sc->sa.agno); > + > + error = xfs_alloc_fix_freelist(&args, > + can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK); > + xfs_perag_put(args.pag); with all these pag lookups, I'm starting to wonder if you should just add a lookup and store the pag in the scrub context? That'd save a lot of lookups - the pag structures never go away and i never really intended them to be used like this in single, very short use contexts. Grab once per operation, hold the reference across the entire operation context, then release it.... > + return error; > +} > + > +/* > + * Put a block back on the AGFL. > + */ > +STATIC int > +xfs_repair_put_freelist( > + struct xfs_scrub_context *sc, > + xfs_agblock_t agbno) > +{ > + struct xfs_owner_info oinfo; > + struct xfs_perag *pag; > + int error; > + > + /* Make sure there's space on the freelist. */ > + error = xfs_repair_fix_freelist(sc, true); > + if (error) > + return error; > + pag = xfs_perag_get(sc->mp, sc->sa.agno); Because this is how it quickly gets it gets to silly numbers of lookups. That's two now in this function. > + if (pag->pagf_flcount == 0) { > + xfs_perag_put(pag); > + return -EFSCORRUPTED; Why is having an empty freelist a problem here? It's an AG thatis completely out of space, but it isn't corruption? And I don't see why an empty freelist prevents us from adding a backs back onto the AGFL? > + } > + xfs_perag_put(pag); > + > + /* > + * Since we're "freeing" a lost block onto the AGFL, we have to > + * create an rmap for the block prior to merging it or else other > + * parts will break. > + */ > + xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG); > + error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno, agbno, 1, > + &oinfo); > + if (error) > + return error; > + > + /* Put the block on the AGFL. */ > + error = xfs_alloc_put_freelist(sc->tp, sc->sa.agf_bp, sc->sa.agfl_bp, > + agbno, 0); There's another pag lookup in here. > + if (error) > + return error; > + xfs_extent_busy_insert(sc->tp, sc->sa.agno, agbno, 1, > + XFS_EXTENT_BUSY_SKIP_DISCARD); And another in here, so 4 perag lookups for the same structure in one simple operation. The code here in the function is fine, but I really think we need to rethink how we use the perag in our allocation code... > + > + return 0; > +} > + > +/* Dispose of a single metadata block. */ > +STATIC int > +xfs_repair_dispose_btree_block( > + struct xfs_scrub_context *sc, > + xfs_fsblock_t fsbno, > + struct xfs_owner_info *oinfo, > + enum xfs_ag_resv_type resv) > +{ > + struct xfs_btree_cur *cur; > + struct xfs_buf *agf_bp = NULL; > + xfs_agnumber_t agno; > + xfs_agblock_t agbno; > + bool has_other_rmap; > + int error; > + > + agno = XFS_FSB_TO_AGNO(sc->mp, fsbno); > + agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno); > + > + if (sc->ip) { > + /* Repairing per-inode metadata, read in the AGF. */ This would be better with a comment above saying: /* * If we are repairing per-inode metadata, we need to read * in the AGF buffer. Otherwise we can re-use the existing * AGF buffer we set up for repairing the per-AG btrees. */ > + error = xfs_alloc_read_agf(sc->mp, sc->tp, agno, 0, &agf_bp); > + if (error) > + return error; > + if (!agf_bp) > + return -ENOMEM; > + } else { > + /* Repairing per-AG btree, reuse existing AGF buffer. */ > + agf_bp = sc->sa.agf_bp; > + } > + cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, agno); > + > + /* Can we find any other rmappings? */ > + error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap); > + if (error) > + goto out_cur; > + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); > + > + /* > + * If there are other rmappings, this block is cross linked and must > + * not be freed. Remove the reverse mapping and move on. Otherwise, Why do we just remove the reverse mapping if the block cannot be freed? I have my suspicions that this is removing cross-links one by one until there's only one reference left to the extent, but then I ask "how do we know which one is the correct mapping"? i.e. this comment raised more questions about the algorithm for dealing with cross-linked blocks - which doesn't appear to be explained anywhere - than it answers.... > + * we were the only owner of the block, so free the extent, which will > + * also remove the rmap. > + */ > + if (has_other_rmap) > + error = xfs_rmap_free(sc->tp, agf_bp, agno, agbno, 1, oinfo); > + else if (resv == XFS_AG_RESV_AGFL) > + error = xfs_repair_put_freelist(sc, agbno); > + else > + error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv); > + if (agf_bp != sc->sa.agf_bp) > + xfs_trans_brelse(sc->tp, agf_bp); > + if (error) > + return error; > + > + if (sc->ip) > + return xfs_trans_roll_inode(&sc->tp, sc->ip); > + return xfs_repair_roll_ag_trans(sc); > + > +out_cur: > + xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); > + if (agf_bp != sc->sa.agf_bp) > + xfs_trans_brelse(sc->tp, agf_bp); > + return error; > +} > + > +/* > + * Dispose of a given metadata extent. > + * > + * If the rmapbt says the extent has multiple owners, we simply remove the > + * rmap associated with this owner hoping that we'll eventually disentangle > + * the crosslinked metadata. Otherwise, there's one owner, so call the > + * regular free code to remove the rmap and free the extent. Any existing > + * buffers for the blocks in the extent must have been _binval'd previously. Ok, so there's a little more detail about cross-linked files. Seems my suspicions of "remove and hope" were close to the mark :P Perhaps we need a document that describes the various algorithms we use for resolving these problems, so they can be discussed and improved without having to troll through the code to understand? > + */ > +STATIC int > +xfs_repair_dispose_btree_extent( > + struct xfs_scrub_context *sc, > + xfs_fsblock_t fsbno, > + xfs_extlen_t len, > + struct xfs_owner_info *oinfo, > + enum xfs_ag_resv_type resv) > +{ > + struct xfs_mount *mp = sc->mp; > + int error = 0; > + > + ASSERT(xfs_sb_version_hasrmapbt(&mp->m_sb)); > + ASSERT(sc->ip != NULL || XFS_FSB_TO_AGNO(mp, fsbno) == sc->sa.agno); > + > + trace_xfs_repair_dispose_btree_extent(mp, XFS_FSB_TO_AGNO(mp, fsbno), > + XFS_FSB_TO_AGBNO(mp, fsbno), len); > + > + for (; len > 0; len--, fsbno++) { > + error = xfs_repair_dispose_btree_block(sc, fsbno, oinfo, resv); > + if (error) > + return error; So why do we do this one block at a time, rather than freeing it as an entire extent in one go? And if there is only a single caller, why not just open code the loop in the caller? > +/* > + * Invalidate buffers for per-AG btree blocks we're dumping. We assume that > + * exlist points only to metadata blocks. > + */ > +int > +xfs_repair_invalidate_blocks( > + struct xfs_scrub_context *sc, > + struct xfs_repair_extent_list *exlist) > +{ > + struct xfs_repair_extent *rex; > + struct xfs_repair_extent *n; > + struct xfs_buf *bp; > + xfs_agnumber_t agno; > + xfs_agblock_t agbno; > + xfs_agblock_t i; > + > + for_each_xfs_repair_extent_safe(rex, n, exlist) { > + agno = XFS_FSB_TO_AGNO(sc->mp, rex->fsbno); > + agbno = XFS_FSB_TO_AGBNO(sc->mp, rex->fsbno); > + for (i = 0; i < rex->len; i++) { > + bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno, > + agbno + i, 0); > + xfs_trans_binval(sc->tp, bp); > + } Again, this is doing things by single blocks. We do have multi-block metadata (inodes, directory blocks, remote attrs) that, if it is already in memory, needs to be treated as multi-block extents. If we don't do that, we'll cause aliasing problems in the buffer cache (see _xfs_buf_obj_cmp()) and it's all downhill from there. That's why I get worried when I see assumptions that we can process contiguous metadata ranges in single block buffers.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html