On Thu, May 31, 2018 at 08:42:06AM +0300, Amir Goldstein wrote: > On Wed, May 30, 2018 at 10:31 PM, Darrick J. Wong > <darrick.wong@xxxxxxxxxx> wrote: > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > Rebuild the reverse mapping btree from all primary metadata. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > --- > > fs/xfs/Makefile | 1 > > fs/xfs/scrub/common.c | 6 > > fs/xfs/scrub/repair.c | 119 +++++++ > > fs/xfs/scrub/repair.h | 27 + > > fs/xfs/scrub/rmap.c | 6 > > fs/xfs/scrub/rmap_repair.c | 796 ++++++++++++++++++++++++++++++++++++++++++++ > > fs/xfs/scrub/scrub.c | 18 + > > fs/xfs/scrub/scrub.h | 2 > > fs/xfs/xfs_mount.h | 1 > > fs/xfs/xfs_super.c | 27 + > > fs/xfs/xfs_trans.c | 7 > > 11 files changed, 1004 insertions(+), 6 deletions(-) > > create mode 100644 fs/xfs/scrub/rmap_repair.c > > > > > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile > > index 7c442f83b179..b9bbac3d5075 100644 > > --- a/fs/xfs/Makefile > > +++ b/fs/xfs/Makefile > > @@ -178,6 +178,7 @@ xfs-y += $(addprefix scrub/, \ > > alloc_repair.o \ > > ialloc_repair.o \ > > repair.o \ > > + rmap_repair.o \ > > ) > > endif > > endif > > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c > > index 89938b328954..f92994716522 100644 > > --- a/fs/xfs/scrub/common.c > > +++ b/fs/xfs/scrub/common.c > > @@ -603,9 +603,13 @@ xfs_scrub_trans_alloc( > > struct xfs_scrub_context *sc, > > uint resblks) > > { > > + uint flags = 0; > > + > > + if (sc->fs_frozen) > > + flags |= XFS_TRANS_NO_WRITECOUNT; > > if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) > > return xfs_trans_alloc(sc->mp, &M_RES(sc->mp)->tr_itruncate, > > - resblks, 0, 0, &sc->tp); > > + resblks, 0, flags, &sc->tp); > > > > return xfs_trans_alloc_empty(sc->mp, &sc->tp); > > } > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c > > index 45a91841c0ac..4b5d599d53b9 100644 > > --- a/fs/xfs/scrub/repair.c > > +++ b/fs/xfs/scrub/repair.c > > @@ -43,6 +43,8 @@ > > #include "xfs_ag_resv.h" > > #include "xfs_trans_space.h" > > #include "xfs_quota.h" > > +#include "xfs_bmap.h" > > +#include "xfs_bmap_util.h" > > #include "scrub/xfs_scrub.h" > > #include "scrub/scrub.h" > > #include "scrub/common.h" > > @@ -1146,3 +1148,120 @@ xfs_repair_mod_ino_counts( > > (int64_t)freecount - old_freecount); > > } > > } > > + > > +/* > > + * Freeze the FS against all other activity so that we can avoid ABBA > > + * deadlocks while taking locks in unusual orders so that we can rebuild > > + * metadata structures such as the rmapbt. > > + */ > > +int > > +xfs_repair_fs_freeze( > > + struct xfs_scrub_context *sc) > > +{ > > + int error; > > + > > + error = freeze_super(sc->mp->m_super); > > + if (error) > > + return error; > > + sc->fs_frozen = true; > > + return 0; > > +} > > + > > +/* Unfreeze the FS. */ > > +int > > +xfs_repair_fs_thaw( > > + struct xfs_scrub_context *sc) > > +{ > > + struct inode *inode, *o; > > + int error; > > + > > + sc->fs_frozen = false; > > + error = thaw_super(sc->mp->m_super); > > + > > + inode = sc->frozen_inode_list; > > + while (inode) { > > + o = inode->i_private; > > + inode->i_private = NULL; > > + iput(inode); > > + inode = o; > > + } > > + > > + return error; > > +} > > + > > > I think that new mechanism is worth a mention in the commit message, > if not a patch of its own with cc to fsdevel. > In a discussion on said patch I would ask: how does xfs_repair_fs_freeze() > work in collaboration with user initiated fsfreeze? > Is there a situation where LVM can be fooled to think that XFS is really > frozen, but it is actually "repair frozen"? and metadata can change while > taking a snapshot? Notice how xfs added a ->freeze_super handler to the superblock operations that prohibits userspace from initiating a freeze while any repair operations are running? If userspace (lvm, etc.) try to initiate a freeze while repair is running, the freeze attempt is kicked back to userspace with -EBUSY. Similarly, a new xfs ->thaw_super handler prevents userspace from unfreezing while repair is running. Granted, the current patch doesn't quite work right either; I've replaced m_scrubbers with a mutex that repair holds for the duration of the repair freeze; this way regular freeze/thaw requests will block until the repair is finished. > This is why I suggested to add a VFS freeze level, e.g. > SB_FREEZE_FS_MAINTAINANCE so that you don't publish XFS state > as SB_FREEZE_COMPLETE while you are modifying metadata on disk. > It might be sufficient to get XFS to state SB_FREEZE_COMPLETE and > then up only to SB_FREEZE_FS in xfs_repair_fs_freeze() without > adding any new states. I tried that, and it didn't work. We actually /do/ want to be at SB_FREEZE_COMPLETE so that repair is the /only/ thread that can change any filesystem state. Under normal circumstances, XFS transaction allocation will block until the SB_FREEZE_COMPLETE condition clears. This stops any background space reclamation from happening, at least if it requires a transaction. Online repair of course grants itself the ability to run transactions even during FREEZE_COMPLETE. Will the following comment (to be embedded in the repair code) explain this all sufficiently? /* * Freezing the Filesystem for a Repair * ==================================== * * While most repair activity can occur while the filesystem is live, * there are certain scenarios where we cannot tolerate concurrent * metadata updates. We therefore must freeze the filesystem against * all other changes. * * The typical scenarios envisioned for repair freezes are (a) to avoid * ABBA deadlocks when need to take locks in an unusual order; or (b) to * update global filesystem state. For example, reconstruction of a * damaged reverse mapping btree requires us to hold the AG header locks * while scanning inodes, which goes against the usual inode -> AG * header locking order. * * A note about inode reclaim: when we freeze the filesystem, users * can't modify things and periodic background reclaim of speculative * preallocations and copy-on-write staging extents is stopped. * However, the repair thread must be careful about evicting an inode * from memory -- if the eviction would require a transaction, we must * defer the iput until after the repair freeze. The reasons for this * are twofold: first, repair already has a transaction and xfs can't * nest transactions; and second, we froze the fs to prevent * modifications that repair doesn't control directly. * * Userspace is prevented from freezing or thawing the filesystem during * a repair freeze by the ->freeze_super and ->thaw_super superblock * operations, which block any changes to the freeze state while a * repair freeze is running through the use of the m_repair_freeze * mutex. It only makes sense to run one repair freeze at a time, so * the mutex is fine. * * Repair freezes cannot be initiated during a regular freeze because * freeze_super does not allow nested freeze. Repair activity that does * not require a repair freeze is also prevented from running during a * regular freeze because transaction allocation blocks on the regular * freeze. We assume that the only other users of * XFS_TRANS_NO_WRITECOUNT transactions either aren't modifying space * metadata in a way that would affect repair, or that we can inhibit * any of the ones that do. */ --D > > Thanks, > Amir. > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html