Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 16 Mar 2022 13:29:01 -0400

On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > xfs_reserve_blocks controls the size of the user-visible free space
> > > reserve pool.  Given the difference between the current and requested
> > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > the amount requested from fdblocks is also constrained by the amount of
> > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > 
> > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > out the "free space" used by the free space btrees, because some portion
> > > of the free space btrees hold in reserve space for future btree
> > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > 
> > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > loop so that it will not retry infinitely.
> > > 
> > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > with an fstests configuration that specifies a 1k blocksize and a
> > > specially crafted log size that will consume 7/8 of the space (17920
> > > blocks, specifically) in that AG.
> > > 
> > > Cc: Brian Foster <bfoster@xxxxxxxxxx>
> > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > ---
> > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 33e26690a8c4..78b6982ea5b0 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > >  	int64_t			fdblks_delta = 0;
> > >  	uint64_t		request;
> > >  	int64_t			free;
> > > +	unsigned int		tries;
> > >  	int			error = 0;
> > >  
> > >  	/* If inval is null, report current values and return */
> > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > >  	 * perform a partial reservation if the request exceeds free space.
> > >  	 */
> > >  	error = -ENOSPC;
> > > -	do {
> > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > -						mp->m_alloc_set_aside;
> > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > 
> > Any reason for the magic number of retries as opposed to perhaps just
> > not retrying at all?
> 
> I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> block reservation mechanism."), where I guess Dave decided that we
> should loop forever trying to satisfy a request from userspace to
> increase the reserve pool.  OFC you and I have been patching this
> function to fix all its horrible warts over the years, so maybe you're
> right that this should only try once...
> 
> (For the mount time default reservation, we should only iterate the loop
> once (provided the accounting is correct ;) since nobody else is
> touching the free space counters.)
> 
> > This seems a little odd when you think about it
> > given that the request is already intended to take available space into
> > account and modify the request from userspace. OTOH, another
> > consideration could be to retry some (really large?) number of times and
> > then bail out if we happen to iterate without an observable change in
> > free space (i.e., something is wrong), however I suppose that could be
> > racy as well. *shrug*
> 
> ...but if you're the sysadmin desperately trying to increase the size of
> the reserve pool when the fs is running near ENOSPC, you're going to be
> racing with fdblocks bouncing up and down.  The @free samples that we
> take here in the loop body are indeed racy since we can't tell the
> difference between @free being unchanged from the last iteration because
> someone freed a block and someone else immediately consumed it, or a
> totally idle system.
> 
> Either way, it's better than hanging the whole system. :)
> 

Yeah.. I'm not bothered much by whether we retry once, 42 times or
forever. I think what this boils down to for me is whether it's worth
the risk of a behavior change of an -ENOSPC return causing something
unexpected for some random user or use case. Could we just do this in
two separate patches? Patch 1 fixes the calculation and targets stable,
patch 2 does whatever to the retry loop that potentially changes retry
semantics (and doesn't really need backporting)..?

> What if I augment the loop control with a comment capturing some of this:
> 
> 	/*
> 	 * The loop body estimates how many blocks it can request from
> 	 * fdblocks to stash in the reserve pool.  This is a classic
> 	 * TOCTOU race since fdblocks updates are not always coordinated
> 	 * via m_sb_lock.  We also cannot tell if @free remaining
> 	 * unchanged between iterations is due to an idle system or
> 	 * freed blocks being consumed immediately, so we'll try a
> 	 * finite number of times to satisfy the request.
> 	 */
> 	for (tries = 0; tries < 30...) {
> 
> > 
> > > +		/*
> > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > +		 * will not give us.  This includes the per-AG set-aside space
> > > +		 * and free space btree blocks that are not available for
> > > +		 * allocation due to per-AG metadata reservations.
> > > +		 */
> > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > +		free -= mp->m_alloc_set_aside;
> > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > 
> > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > Not sure how much it matters for space reporting purposes, but if so, it
> > might also be worth reconsidering the usefulness of a static field and
> > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> 
> When I was writing this patch, I very nearly decided to make those three
> lines above their own helper.  I didn't see any other spots that looked
> like obvious candidates for such a calculation outside of statfs.
> 

Indeed..

> Subtracting m_allocbt_blks from statfs' avail field is a behavior
> change, since we always used to consider bnobt blocks as available.  We
> don't have an exact count of how many blocks are needed to hide the
> per-ag reserved extents, so in the end we have to decide whether we want
> to hear complaints about over- or under-estimation of available blocks.
> 
> So I think the statfs stuff is a separate patch. :)
> 

Similar deal as above.. I'm more interested in a potential cleanup of
the code that helps prevent this sort of buglet for the next user of
->m_alloc_set_aside that will (expectedly) have no idea about this
subtle quirk than I am about what's presented in the free space
counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
or something that just does the (agcount << 3) thing, and then define a
new xfs_alloc_set_aside() that combines the macro calculation with
->m_allocbt_blks. Then the whole "set aside" concept is calculated and
documented in one place. Hm?

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  		if (free <= 0)
> > >  			break;
> > >  
> > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > >  		spin_unlock(&mp->m_sb_lock);
> > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > >  		spin_lock(&mp->m_sb_lock);
> > > -	} while (error == -ENOSPC);
> > > +	}
> > >  
> > >  	/*
> > >  	 * Update the reserve counters if blocks have been successfully
> > > 
> > 
>