Re: [PATCH 1/2] xfs: recalculate free rt extents after log recovery

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 8 Apr 2022 10:42:21 -0700

On Fri, Apr 08, 2022 at 10:06:05AM +1000, Dave Chinner wrote:
> On Thu, Apr 07, 2022 at 04:39:41PM -0700, Darrick J. Wong wrote:
> > On Fri, Apr 08, 2022 at 07:56:05AM +1000, Dave Chinner wrote:
> > > On Thu, Apr 07, 2022 at 01:46:56PM -0700, Darrick J. Wong wrote:
> > > > @@ -1284,6 +1285,43 @@ xfs_rtmount_init(
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static inline int
> > > > +xfs_rtalloc_count_frextent(
> > > > +	struct xfs_trans		*tp,
> > > > +	const struct xfs_rtalloc_rec	*rec,
> > > > +	void				*priv)
> > > > +{
> > > > +	uint64_t			*valp = priv;
> > > > +
> > > > +	*valp += rec->ar_extcount;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/* Reinitialize the number of free realtime extents from the realtime bitmap. */
> > > > +STATIC int
> > > > +xfs_rtalloc_reinit_frextents(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	struct xfs_trans	*tp;
> > > > +	uint64_t		val = 0;
> > > > +	int			error;
> > > > +
> > > > +	error = xfs_trans_alloc_empty(mp, &tp);
> > > > +	if (error)
> > > > +		return error;
> > > 
> > > At this point in the mount, the transaction subsystem is not
> > > really available for use. I understand that this is an empty
> > > transaction and it doesn't reserve log space, but I don't like
> > > creating an implicit requirement that xfs_trans_alloc_empty() never
> > > touches log state for this code to work correctly into the future.
> > 
> > <nod> I'd wondered if it was really a good idea to be creating a
> > transaction at the "end" of log recovery.  Recovery itself does it, but
> > at least that's a carefully controlled box...
> > 
> > > That is, the log isn't in a state where we can run normal
> > > transactions because we can still have pending intent recovery
> > > queued in the AIL. These can't be moved from the tail of the AIL
> > > until xfs_log_mount_finish() is called to remove and process them.
> > > They are dependent on there being enough log space remaining for the
> > > transaction reservation they make to succeed and there may only be
> > > just enough space for the first intent to make that reservation.
> > > Hence if we consume any log space before we recover intents, we can
> > > deadlock on log space when then trying to recover the intents. And
> > > because the AIL can't push on intents, the same reservation deadlock
> > > could occur with any transaction reservation we try to run before
> > > xfs_log_mount_finish() has been run.
> > > 
> > > Yes, I know this is an empty transaction and so it doesn't push on
> > > the log or consume any space. But I think it's a bad precedent even
> > > to use transactions at all when we know we are in the middle of log
> > > recovery and the transaction subsystem is not open for general use
> > > yet.  Hence I think using transaction handles - even empty ones -
> > > before xfs_log_mount_finish() is run is a pattern we want to avoid
> > > if at all possible.
> > 
> > ...yes...
> > 
> > > > +
> > > > +	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL);
> > > > +	error = xfs_rtalloc_query_all(tp, xfs_rtalloc_count_frextent, &val);
> > > 
> > > Looking at the code here, AFAICT the only thing the tp is absolutely
> > > required for here is to hold the mount pointer - all the internal
> > > bitmap walk operations grab and release buffers using interfaces
> > > that can handle a null tp being passed.
> > > 
> > > Hence it looks to me like a transaction context isn't necessary for
> > > this read-only bitmap walk. Is that correct?  Converting
> > > xfs_rtalloc_query_all() to take a mount and a trans would also make
> > > this code becomes a lot simpler....
> > 
> > ...and I was about to say "Yeah, we could make xfs_rtalloc_query_all
> > take the mount and an optional transaction", but it occurred to me:
> > 
> > The rtalloc queries need to read the rtbitmap file's data fork mappings
> > into memory to find the blocks.  If the rtbitmap is fragmented enough
> > that the data fork is in btree format and the bmbt contains an upward
> > cycle, loading the bmbt will livelock the mount attempt.
> 
> "upward cycle", as in a corrupted btree?

Yes.

> > Granted, xfs_bmapi_read doesn't take a transaction, so this means that
> > we'd either have to change it to take one, or we'd have to change the
> > rtmount code to create an empty transaction and call iread_extents to
> > detect a bmbt cycle.
> 
> 
> > So yeah, I think the answer to your question is "yes", but then there's
> > this other issue... :(
> 
> Yup, but keep in mind that:
> 
> xfs_fs_reserve_ag_blocks(tp == NULL)
>   xfs_ag_resv_init
>     xfs_finobt_calc_reserves
>       xfs_inobt_count_blocks
>         xfs_btree_count_blocks
> 
> Will do a finobt btree walk without a transaction pointer if
> we don't have finobt block counts recorded in the AGI. So it
> seems to me that we are already at risk of cycles in corrupted
> btrees being able to cause mount hangs.
> 
> Hence from that perspective, I think worrying about a corrupted
> rtbitmap BMBT btree is largely outside the scope of this patchset;
> it's a generic issue with transaction-less btree walks, and one we
> are already exposed to in the same phase of the mount/log recovery
> process. Hence I think we should consider how to solve this problem
> seperately rather that tie this patchset into twisty knots trying to
> deal with here...

<nod> I've long wanted to do a full audit of all the places we walk
ondisk graph metadata without a transaction, but still haven't gotten
to it... :(

Anyway, I'll leave /that/ for a future series.  In the meantime, I'll
amend the rtalloc query functions to take mp and tp.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx