Re: [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 8 Apr 2022 09:17:25 +1000

On Thu, Apr 07, 2022 at 01:47:02PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> As mentioned in the previous commit, the kernel misuses sb_frextents in
> the incore mount to reflect both incore reservations made by running
> transactions as well as the actual count of free rt extents on disk.
> This results in the superblock being written to the log with an
> underestimate of the number of rt extents that are marked free in the
> rtbitmap.
> 
> Teaching XFS to recompute frextents after log recovery avoids
> operational problems in the current mount, but it doesn't solve the
> problem of us writing undercounted frextents which are then recovered by
> an older kernel that doesn't have that fix.
> 
> Create an incore percpu counter to mirror the ondisk frextents.  This
> new counter will track transaction reservations and the only time we
> will touch the incore super counter (i.e the one that gets logged) is
> when those transactions commit updates to the rt bitmap.  This is in
> contrast to the lazysbcount counters (e.g. fdblocks), where we know that
> log recovery will always fix any incorrect counter that we log.
> As a bonus, we only take m_sb_lock at transaction commit time.

Again, the concept looks fine as does most of the code. Some
comments on the implementation below.

> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index c5f153c3693f..d5463728c305 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1183,17 +1183,37 @@ xfs_mod_frextents(
>  	struct xfs_mount	*mp,
>  	int64_t			delta)
>  {
> -	int64_t			lcounter;
> -	int			ret = 0;
> +	int			batch;
>  
> -	spin_lock(&mp->m_sb_lock);
> -	lcounter = mp->m_sb.sb_frextents + delta;
> -	if (lcounter < 0)
> -		ret = -ENOSPC;
> +	if (delta > 0) {
> +		percpu_counter_add(&mp->m_frextents, delta);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Taking blocks away, need to be more accurate the closer we
> +	 * are to zero.
> +	 *
> +	 * If the counter has a value of less than 2 * max batch size,
> +	 * then make everything serialise as we are real close to
> +	 * ENOSPC.
> +	 */
> +	if (__percpu_counter_compare(&mp->m_frextents, 2 * XFS_FDBLOCKS_BATCH,
> +				     XFS_FDBLOCKS_BATCH) < 0)
> +		batch = 1;
>  	else
> -		mp->m_sb.sb_frextents = lcounter;
> -	spin_unlock(&mp->m_sb_lock);
> -	return ret;
> +		batch = XFS_FDBLOCKS_BATCH;
> +
> +	percpu_counter_add_batch(&mp->m_frextents, delta, batch);
> +	if (__percpu_counter_compare(&mp->m_frextents, 0,
> +				     XFS_FDBLOCKS_BATCH) >= 0) {
> +		/* we had space! */
> +		return 0;
> +	}
> +
> +	/* oops, negative free space, put that back! */
> +	percpu_counter_add(&mp->m_frextents, -delta);
> +	return -ENOSPC;
>  }

Ok, so this looks like a copy-pasta of xfs_mod_fdblocks() with the
reservation pool stuff stripped. I'd kinda prefer to factor
xfs_mod_fdblocks() so that we aren't blowing out the instruction
cache footprint on this hot path - we're frequently modifying
both RT and fd block counts in the same transaction, so having them
run the same code would be good.

Something like:

int
xfs_mod_blocks(
	struct xfs_mount	*mp,
	struct pcp_counter	*pccnt,
	int64_t			delta,
	bool			use_resv_pool,
	bool			rsvd)
{
	int64_t                 lcounter;
	long long               res_used;
	s32                     batch;
	uint64_t                set_aside = 0;

	if (delta > 0) {
	       /*
		* If the reserve pool is depleted, put blocks back into it
		* first. Most of the time the pool is full.
		*/
	       if (likely(!use_resv_pool || mp->m_resblks == mp->m_resblks_avail)) {
		       percpu_counter_add(pccnt, delta);
		       return 0;
	       }

	       spin_lock(&mp->m_sb_lock);
	       res_used = (long long)(mp->m_resblks - mp->m_resblks_avail);

	       if (res_used > delta) {
		       mp->m_resblks_avail += delta;
	       } else {
		       delta -= res_used;
		       mp->m_resblks_avail = mp->m_resblks;
		       percpu_counter_add(&mp->m_fdblocks, delta);
	       }
	       spin_unlock(&mp->m_sb_lock);
	       return 0;
	}

	/*
	* Taking blocks away, need to be more accurate the closer we
	* are to zero.
	*
	* If the counter has a value of less than 2 * max batch size,
	* then make everything serialise as we are real close to
	* ENOSPC.
	*/
	if (__percpu_counter_compare(pccnt, 2 * XFS_FDBLOCKS_BATCH,
				    XFS_FDBLOCKS_BATCH) < 0)
	       batch = 1;
	else
	       batch = XFS_FDBLOCKS_BATCH;

	/*
	* Set aside allocbt blocks because these blocks are tracked as free
	* space but not available for allocation. Technically this means that a
	* single reservation cannot consume all remaining free space, but the
	* ratio of allocbt blocks to usable free blocks should be rather small.
	* The tradeoff without this is that filesystems that maintain high
	* perag block reservations can over reserve physical block availability
	* and fail physical allocation, which leads to much more serious
	* problems (i.e. transaction abort, pagecache discards, etc.) than
	* slightly premature -ENOSPC.
	*/
	if (use_resv_pool)
		set_aside = xfs_fdblocks_unavailable(mp);
	percpu_counter_add_batch(pccnt, delta, batch);
	if (__percpu_counter_compare(&pccnt, set_aside,
				    XFS_FDBLOCKS_BATCH) >= 0) {
	       /* we had space! */
	       return 0;
	}

	/*
	* lock up the sb for dipping into reserves before releasing the space
	* that took us to ENOSPC.
	*/
	spin_lock(&mp->m_sb_lock);
	percpu_counter_add(pccnt, -delta);
	if (!use_resv_pool || !rsvd)
	       goto fdblocks_enospc;

	lcounter = (long long)mp->m_resblks_avail + delta;
	if (lcounter >= 0) {
	       mp->m_resblks_avail = lcounter;
	       spin_unlock(&mp->m_sb_lock);
	       return 0;
	}
	xfs_warn_once(mp,
"Reserve blocks depleted! Consider increasing reserve pool size.");

fdblocks_enospc:
	spin_unlock(&mp->m_sb_lock);
	return -ENOSPC;
}

And in the relevant header file:

int xfs_mod_blocks(struct xfs_mount *mp, struct pcp_counter *pccnt,
		int64_t delta, bool use_resv_pool, bool rsvd);

static inline int
xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, bool rsvd)
{
	return xfs_mod_blocks(mp, &mp->m_fdblocks, delta, true, resvd);
}

static inline int
xfs_mod_frextents(struct xfs_mount *mp, int64_t delta)
{
	return xfs_mod_blocks(mp, &mp->m_frextents, delta, false, false);
}

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 54be9d64093e..cc95768eb8e1 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -843,9 +843,11 @@ xfs_fs_statfs(
>  
>  	if (XFS_IS_REALTIME_MOUNT(mp) &&
>  	    (ip->i_diflags & (XFS_DIFLAG_RTINHERIT | XFS_DIFLAG_REALTIME))) {
> +		s64	freertx;
> +
>  		statp->f_blocks = sbp->sb_rblocks;
> -		statp->f_bavail = statp->f_bfree =
> -			sbp->sb_frextents * sbp->sb_rextsize;
> +		freertx = max_t(s64, 0, percpu_counter_sum(&mp->m_frextents));

percpu_counter_sum_positive()

>  	if (error)
>  		goto free_fdblocks;
>  
> +	error = percpu_counter_init(&mp->m_frextents, 0, GFP_KERNEL);
> +	if (error)
> +		goto free_delalloc;
> +
>  	return 0;
>  
> +free_delalloc:
> +	percpu_counter_destroy(&mp->m_delalloc_blks);
>  free_fdblocks:
>  	percpu_counter_destroy(&mp->m_fdblocks);
>  free_ifree:
> @@ -1033,6 +1041,7 @@ xfs_reinit_percpu_counters(
>  	percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
>  	percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
>  	percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
> +	percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents);
>  }
>  
>  static void
> @@ -1045,6 +1054,7 @@ xfs_destroy_percpu_counters(
>  	ASSERT(xfs_is_shutdown(mp) ||
>  	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
>  	percpu_counter_destroy(&mp->m_delalloc_blks);
> +	percpu_counter_destroy(&mp->m_frextents);
>  }
>  
>  static int
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 0ac717aad380..63a4d3a24340 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas(
>  			be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta);
>  	}
>  
> -	if (tp->t_frextents_delta)
> -		be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta);
> -	if (tp->t_res_frextents_delta)
> -		be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta);
> +	/*
> +	 * Updating frextents requires careful handling because it does not
> +	 * behave like the lazysb counters because we cannot rely on log
> +	 * recovery in older kenels to recompute the value from the rtbitmap.
> +	 * This means that the ondisk frextents must be consistent with the
> +	 * rtbitmap.
> +	 *
> +	 * Therefore, log the frextents change to the ondisk superblock and
> +	 * update the incore superblock so that future calls to xfs_log_sb
> +	 * write the correct value ondisk.
> +	 *
> +	 * Don't touch m_frextents because it includes incore reservations,
> +	 * and those are handled by the unreserve function.
> +	 */
> +	if (tp->t_frextents_delta || tp->t_res_frextents_delta) {
> +		struct xfs_mount	*mp = tp->t_mountp;
> +		int64_t			rtxdelta;
> +
> +		rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta;
> +
> +		spin_lock(&mp->m_sb_lock);
> +		be64_add_cpu(&sbp->sb_frextents, rtxdelta);
> +		mp->m_sb.sb_frextents += rtxdelta;
> +		spin_unlock(&mp->m_sb_lock);
> +	}

Hmmmm.  This wants a comment in xfs_log_sb() to explain why we
aren't updating mp->m_sb.sb_frextents from mp->m_frextents like we
do with all the other per-cpu counters tracking resource usage.

>  
>  	if (tp->t_dblocks_delta) {
>  		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> @@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb(
>  	if (ifreedelta)
>  		percpu_counter_add(&mp->m_ifree, ifreedelta);
>  
> -	if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
> +	if (rtxdelta) {
> +		error = xfs_mod_frextents(mp, rtxdelta);
> +		ASSERT(!error);
> +	}
> +
> +	if (!(tp->t_flags & XFS_TRANS_SB_DIRTY))
>  		return;
>  
>  	/* apply remaining deltas */
> @@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb(
>  	mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta;
>  	mp->m_sb.sb_icount += idelta;
>  	mp->m_sb.sb_ifree += ifreedelta;
> -	mp->m_sb.sb_frextents += rtxdelta;

This makes my head hurt trying to work out if this is necessary or
not. (the lazy sb stuff in these functions has always strained my
cognitive abilities, even though I wrote it in the first place!)

A comment explaining why we don't need to update
mp->m_sb.sb_frextents when XFS_TRANS_SB_DIRTY is set would be useful
in the above if (rtxdelta) update above.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx