Re: [PATCH 11/21] xfs: repair the rmapbt

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 3 Jul 2018 15:32:00 +1000

On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> 
> Rebuild the reverse mapping btree from all primary metadata.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>

....

> +static inline int xfs_repair_rmapbt_setup(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	/* We don't support rmap repair, but we can still do a scan. */
> +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> +}

This comment seems at odds with the commit message....

....

> +/*
> + * Reverse Mapping Btree Repair
> + * ============================
> + *
> + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> + * else in XFS we lock inodes and then AG data structures, but generating the
> + * list of rmap records requires that we be able to scan both block mapping
> + * btrees of every inode in the filesystem to see if it owns any extents in
> + * this AG.  We can't tolerate any inode updates while we do this, so we
> + * freeze the filesystem to lock everyone else out, and grant ourselves
> + * special privileges to run transactions with regular background reclamation
> + * turned off.

Hmmm. This implies we are going to scan the entire filesystem for
every AG we need to rebuild the rmap tree in. That seems like an
awful lot of work if there's more than one rmap btree that needs
rebuild.

> + * We also have to be very careful not to allow inode reclaim to start a
> + * transaction because all transactions (other than our own) will block.

What happens when we run out of memory? Inode reclaim will need to
run at that point, right?

> + * So basically we scan all primary per-AG metadata and all block maps of all
> + * inodes to generate a huge list of reverse map records.  Next we look for
> + * gaps in the rmap records to calculate all the unclaimed free space (1).
> + * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract
> + * the space used by those btrees from (1), and also subtract the free space
> + * listed in the bnobt from (1).  What's left are the gaps in assigned space
> + * that the new rmapbt knows about but the existing bnobt doesn't; these are
> + * the blocks from the old rmapbt and they can be freed.

THis looks like a lot of repeated work. We've already scanned a
bunch of these trees to repair them, then thrown away the scan
results. Now we do another scan of what we've rebuilt.....

... hold on. Chicken and egg.

We verify and rebuild all the other trees from the rmap information
- how do we do determine that the rmap needs to rebuilt and that the
metadata it's being rebuilt from is valid?

Given that we've effetively got to shut down access to the
filesystem for the entire rmap rebuild while we do an entire
filesystem scan, why would we do this online? It's going to be
faster to do this rebuild offline (because of all the prefetching,
rebuilding all AG trees from the state gathered in the full
filesystem passes, etc) and we don't have to hack around potential
transaction and memory reclaim deadlock situations, either?

So why do rmap rebuilds online at all?

> + */
> +
> +/* Set us up to repair reverse mapping btrees. */
> +int
> +xfs_repair_rmapbt_setup(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	int				error;
> +
> +	/*
> +	 * Freeze out anything that can lock an inode.  We reconstruct
> +	 * the rmapbt by reading inode bmaps with the AGF held, which is
> +	 * only safe w.r.t. ABBA deadlocks if we're the only ones locking
> +	 * inodes.
> +	 */
> +	error = xfs_scrub_fs_freeze(sc);
> +	if (error)
> +		return error;
> +
> +	/* Check the AG number and set up the scrub context. */
> +	error = xfs_scrub_setup_fs(sc, ip);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Lock all the AG header buffers so that we can read all the
> +	 * per-AG metadata too.
> +	 */
> +	error = xfs_repair_grab_all_ag_headers(sc);
> +	if (error)
> +		return error;

So if we have thousands of AGs (think PB scale filesystems) then
we're going hold many thousands of locked buffers here? Just so we
can rebuild the rmapbt in one AG?

What does holding these buffers locked protect us against that
an active freeze doesn't?

> +xfs_repair_rmapbt_new_rmap(
> +	struct xfs_repair_rmapbt	*rr,
> +	xfs_agblock_t			startblock,
> +	xfs_extlen_t			blockcount,
> +	uint64_t			owner,
> +	uint64_t			offset,
> +	unsigned int			flags)
> +{
> +	struct xfs_repair_rmapbt_extent	*rre;
> +	int				error = 0;
> +
> +	trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno,
> +			startblock, blockcount, owner, offset, flags);
> +
> +	if (xfs_scrub_should_terminate(rr->sc, &error))
> +		return error;
> +
> +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> +	if (!rre)
> +		return -ENOMEM;

This seems like a likely thing to happen given the "no reclaim"
state of the filesystem and the memory demand a rmapbt rebuild
can have. If we've got GBs of rmap info in the AG that needs to be
rebuilt, how much RAM are we going to need to index it all as we
scan the filesystem?

> +xfs_repair_rmapbt_scan_ifork(
> +	struct xfs_repair_rmapbt	*rr,
> +	struct xfs_inode		*ip,
> +	int				whichfork)
> +{
> +	struct xfs_bmbt_irec		rec;
> +	struct xfs_iext_cursor		icur;
> +	struct xfs_mount		*mp = rr->sc->mp;
> +	struct xfs_btree_cur		*cur = NULL;
> +	struct xfs_ifork		*ifp;
> +	unsigned int			rflags;
> +	int				fmt;
> +	int				error = 0;
> +
> +	/* Do we even have data mapping extents? */
> +	fmt = XFS_IFORK_FORMAT(ip, whichfork);
> +	ifp = XFS_IFORK_PTR(ip, whichfork);
> +	switch (fmt) {
> +	case XFS_DINODE_FMT_BTREE:
> +		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> +			error = xfs_iread_extents(rr->sc->tp, ip, whichfork);
> +			if (error)
> +				return error;
> +		}

Ok, so we need inodes locked to do this....

....
> +/* Iterate all the inodes in an AG group. */
> +STATIC int
> +xfs_repair_rmapbt_scan_inobt(
> +	struct xfs_btree_cur		*cur,
> +	union xfs_btree_rec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_inobt_rec_incore	irec;
> +	struct xfs_repair_rmapbt	*rr = priv;
> +	struct xfs_mount		*mp = cur->bc_mp;
> +	struct xfs_inode		*ip = NULL;
> +	xfs_ino_t			ino;
> +	xfs_agino_t			agino;
> +	int				chunkidx;
> +	int				lock_mode = 0;
> +	int				error = 0;
> +
> +	xfs_inobt_btrec_to_irec(mp, rec, &irec);
> +
> +	for (chunkidx = 0, agino = irec.ir_startino;
> +	     chunkidx < XFS_INODES_PER_CHUNK;
> +	     chunkidx++, agino++) {
> +		bool	inuse;
> +
> +		/* Skip if this inode is free */
> +		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
> +			continue;
> +		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
> +
> +		/* Back off and try again if an inode is being reclaimed */
> +		error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino,
> +				&inuse);
> +		if (error == -EAGAIN)
> +			return -EDEADLOCK;

And we can get inode access errors here.....

FWIW, how is the inode being reclaimed if the filesystem is frozen?

> +
> +		/*
> +		 * Grab inode for scanning.  We cannot use DONTCACHE here
> +		 * because we already have a transaction so the iput must not
> +		 * trigger inode reclaim (which might allocate a transaction
> +		 * to clean up posteof blocks).
> +		 */
> +		error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip);

So if there are enough inodes in the AG, we'll run out of memory
here because we aren't reclaiming inodes from the cache but instead
putting them all on the defferred iput list?

> +		if (error)
> +			return error;
> +		trace_xfs_scrub_iget(ip, __this_address);
> +
> +		if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
> +		     !(ip->i_df.if_flags & XFS_IFEXTENTS)) ||
> +		    (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE &&
> +		     !(ip->i_afp->if_flags & XFS_IFEXTENTS)))
> +			lock_mode = XFS_ILOCK_EXCL;
> +		else
> +			lock_mode = XFS_ILOCK_SHARED;
> +		if (!xfs_ilock_nowait(ip, lock_mode)) {
> +			error = -EBUSY;
> +			goto out_rele;
> +		}

And in what situation do we get inodes stuck with the ilock held on
frozen filesysetms?

....

> +out_unlock:
> +	xfs_iunlock(ip, lock_mode);
> +out_rele:
> +	iput(VFS_I(ip));
> +	return error;

calling iput in the error path is a bug - it will trigger all the
paths you're trying to avoid by using te deferred iput list.

....

> +/* Collect rmaps for all block mappings for every inode in this AG. */
> +STATIC int
> +xfs_repair_rmapbt_generate_aginode_rmaps(
> +	struct xfs_repair_rmapbt	*rr,
> +	xfs_agnumber_t			agno)
> +{
> +	struct xfs_scrub_context	*sc = rr->sc;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_buf			*agi_bp;
> +	int				error;
> +
> +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
> +	if (error)
> +		return error;
> +	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO);
> +	error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr);

So if we get a locked or reclaiming inode anywhere in the
filesystem we see EDEADLOCK/EBUSY here without having scanned all
the inodes in the AG, right?

> +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	xfs_trans_brelse(sc->tp, agi_bp);
> +	return error;
> +}
> +
> +/*
> + * Generate all the reverse-mappings for this AG, a list of the old rmapbt
> + * blocks, and the new btreeblks count.  Figure out if we have enough free
> + * space to reconstruct the inode btrees.  The caller must clean up the lists
> + * if anything goes wrong.
> + */
> +STATIC int
> +xfs_repair_rmapbt_find_rmaps(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*rmap_records,
> +	xfs_agblock_t			*new_btreeblks)
> +{
> +	struct xfs_repair_rmapbt	rr;
> +	xfs_agnumber_t			agno;
> +	int				error;
> +
> +	rr.rmaplist = rmap_records;
> +	rr.sc = sc;
> +	rr.nr_records = 0;
> +
> +	/* Generate rmaps for AG space metadata */
> +	error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_log_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr);
> +	if (error)
> +		return error;
> +
> +	/* Iterate all AGs for inodes rmaps. */
> +	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno);
> +		if (error)
> +			return error;

And that means we abort here....

> +/* Repair the rmap btree for some AG. */
> +int
> +xfs_repair_rmapbt(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct list_head		rmap_records;
> +	xfs_extlen_t			new_btreeblks;
> +	int				log_flags = 0;
> +	int				error;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +
> +	/* Collect rmaps for all AG headers. */
> +	INIT_LIST_HEAD(&rmap_records);
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN);
> +	error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks);
> +	if (error)
> +		goto out;

And we drop out here. So, essentially, any ENOMEM, locked inode or
inode in reclaim anywhere in the filesystem will prevent rmap
rebuild. Which says to me that rebuilding the rmap on
on any substantial filesystem is likely to fail.

Which brings me back to my original question: why attempt to do
rmap rebuild online given how complex it is, the performance
implications of a full filesystem scan per AG that needs rebuild,
and all the ways it could easily fail?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html