On Thu, Apr 07, 2022 at 01:47:02PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > As mentioned in the previous commit, the kernel misuses sb_frextents in > the incore mount to reflect both incore reservations made by running > transactions as well as the actual count of free rt extents on disk. > This results in the superblock being written to the log with an > underestimate of the number of rt extents that are marked free in the > rtbitmap. > > Teaching XFS to recompute frextents after log recovery avoids > operational problems in the current mount, but it doesn't solve the > problem of us writing undercounted frextents which are then recovered by > an older kernel that doesn't have that fix. > > Create an incore percpu counter to mirror the ondisk frextents. This > new counter will track transaction reservations and the only time we > will touch the incore super counter (i.e the one that gets logged) is > when those transactions commit updates to the rt bitmap. This is in > contrast to the lazysbcount counters (e.g. fdblocks), where we know that > log recovery will always fix any incorrect counter that we log. > As a bonus, we only take m_sb_lock at transaction commit time. Again, the concept looks fine as does most of the code. Some comments on the implementation below. > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index c5f153c3693f..d5463728c305 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -1183,17 +1183,37 @@ xfs_mod_frextents( > struct xfs_mount *mp, > int64_t delta) > { > - int64_t lcounter; > - int ret = 0; > + int batch; > > - spin_lock(&mp->m_sb_lock); > - lcounter = mp->m_sb.sb_frextents + delta; > - if (lcounter < 0) > - ret = -ENOSPC; > + if (delta > 0) { > + percpu_counter_add(&mp->m_frextents, delta); > + return 0; > + } > + > + /* > + * Taking blocks away, need to be more accurate the closer we > + * are to zero. > + * > + * If the counter has a value of less than 2 * max batch size, > + * then make everything serialise as we are real close to > + * ENOSPC. > + */ > + if (__percpu_counter_compare(&mp->m_frextents, 2 * XFS_FDBLOCKS_BATCH, > + XFS_FDBLOCKS_BATCH) < 0) > + batch = 1; > else > - mp->m_sb.sb_frextents = lcounter; > - spin_unlock(&mp->m_sb_lock); > - return ret; > + batch = XFS_FDBLOCKS_BATCH; > + > + percpu_counter_add_batch(&mp->m_frextents, delta, batch); > + if (__percpu_counter_compare(&mp->m_frextents, 0, > + XFS_FDBLOCKS_BATCH) >= 0) { > + /* we had space! */ > + return 0; > + } > + > + /* oops, negative free space, put that back! */ > + percpu_counter_add(&mp->m_frextents, -delta); > + return -ENOSPC; > } Ok, so this looks like a copy-pasta of xfs_mod_fdblocks() with the reservation pool stuff stripped. I'd kinda prefer to factor xfs_mod_fdblocks() so that we aren't blowing out the instruction cache footprint on this hot path - we're frequently modifying both RT and fd block counts in the same transaction, so having them run the same code would be good. Something like: int xfs_mod_blocks( struct xfs_mount *mp, struct pcp_counter *pccnt, int64_t delta, bool use_resv_pool, bool rsvd) { int64_t lcounter; long long res_used; s32 batch; uint64_t set_aside = 0; if (delta > 0) { /* * If the reserve pool is depleted, put blocks back into it * first. Most of the time the pool is full. */ if (likely(!use_resv_pool || mp->m_resblks == mp->m_resblks_avail)) { percpu_counter_add(pccnt, delta); return 0; } spin_lock(&mp->m_sb_lock); res_used = (long long)(mp->m_resblks - mp->m_resblks_avail); if (res_used > delta) { mp->m_resblks_avail += delta; } else { delta -= res_used; mp->m_resblks_avail = mp->m_resblks; percpu_counter_add(&mp->m_fdblocks, delta); } spin_unlock(&mp->m_sb_lock); return 0; } /* * Taking blocks away, need to be more accurate the closer we * are to zero. * * If the counter has a value of less than 2 * max batch size, * then make everything serialise as we are real close to * ENOSPC. */ if (__percpu_counter_compare(pccnt, 2 * XFS_FDBLOCKS_BATCH, XFS_FDBLOCKS_BATCH) < 0) batch = 1; else batch = XFS_FDBLOCKS_BATCH; /* * Set aside allocbt blocks because these blocks are tracked as free * space but not available for allocation. Technically this means that a * single reservation cannot consume all remaining free space, but the * ratio of allocbt blocks to usable free blocks should be rather small. * The tradeoff without this is that filesystems that maintain high * perag block reservations can over reserve physical block availability * and fail physical allocation, which leads to much more serious * problems (i.e. transaction abort, pagecache discards, etc.) than * slightly premature -ENOSPC. */ if (use_resv_pool) set_aside = xfs_fdblocks_unavailable(mp); percpu_counter_add_batch(pccnt, delta, batch); if (__percpu_counter_compare(&pccnt, set_aside, XFS_FDBLOCKS_BATCH) >= 0) { /* we had space! */ return 0; } /* * lock up the sb for dipping into reserves before releasing the space * that took us to ENOSPC. */ spin_lock(&mp->m_sb_lock); percpu_counter_add(pccnt, -delta); if (!use_resv_pool || !rsvd) goto fdblocks_enospc; lcounter = (long long)mp->m_resblks_avail + delta; if (lcounter >= 0) { mp->m_resblks_avail = lcounter; spin_unlock(&mp->m_sb_lock); return 0; } xfs_warn_once(mp, "Reserve blocks depleted! Consider increasing reserve pool size."); fdblocks_enospc: spin_unlock(&mp->m_sb_lock); return -ENOSPC; } And in the relevant header file: int xfs_mod_blocks(struct xfs_mount *mp, struct pcp_counter *pccnt, int64_t delta, bool use_resv_pool, bool rsvd); static inline int xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, bool rsvd) { return xfs_mod_blocks(mp, &mp->m_fdblocks, delta, true, resvd); } static inline int xfs_mod_frextents(struct xfs_mount *mp, int64_t delta) { return xfs_mod_blocks(mp, &mp->m_frextents, delta, false, false); } > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 54be9d64093e..cc95768eb8e1 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -843,9 +843,11 @@ xfs_fs_statfs( > > if (XFS_IS_REALTIME_MOUNT(mp) && > (ip->i_diflags & (XFS_DIFLAG_RTINHERIT | XFS_DIFLAG_REALTIME))) { > + s64 freertx; > + > statp->f_blocks = sbp->sb_rblocks; > - statp->f_bavail = statp->f_bfree = > - sbp->sb_frextents * sbp->sb_rextsize; > + freertx = max_t(s64, 0, percpu_counter_sum(&mp->m_frextents)); percpu_counter_sum_positive() > if (error) > goto free_fdblocks; > > + error = percpu_counter_init(&mp->m_frextents, 0, GFP_KERNEL); > + if (error) > + goto free_delalloc; > + > return 0; > > +free_delalloc: > + percpu_counter_destroy(&mp->m_delalloc_blks); > free_fdblocks: > percpu_counter_destroy(&mp->m_fdblocks); > free_ifree: > @@ -1033,6 +1041,7 @@ xfs_reinit_percpu_counters( > percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount); > percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree); > percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks); > + percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents); > } > > static void > @@ -1045,6 +1054,7 @@ xfs_destroy_percpu_counters( > ASSERT(xfs_is_shutdown(mp) || > percpu_counter_sum(&mp->m_delalloc_blks) == 0); > percpu_counter_destroy(&mp->m_delalloc_blks); > + percpu_counter_destroy(&mp->m_frextents); > } > > static int > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 0ac717aad380..63a4d3a24340 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas( > be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta); > } > > - if (tp->t_frextents_delta) > - be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta); > - if (tp->t_res_frextents_delta) > - be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta); > + /* > + * Updating frextents requires careful handling because it does not > + * behave like the lazysb counters because we cannot rely on log > + * recovery in older kenels to recompute the value from the rtbitmap. > + * This means that the ondisk frextents must be consistent with the > + * rtbitmap. > + * > + * Therefore, log the frextents change to the ondisk superblock and > + * update the incore superblock so that future calls to xfs_log_sb > + * write the correct value ondisk. > + * > + * Don't touch m_frextents because it includes incore reservations, > + * and those are handled by the unreserve function. > + */ > + if (tp->t_frextents_delta || tp->t_res_frextents_delta) { > + struct xfs_mount *mp = tp->t_mountp; > + int64_t rtxdelta; > + > + rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta; > + > + spin_lock(&mp->m_sb_lock); > + be64_add_cpu(&sbp->sb_frextents, rtxdelta); > + mp->m_sb.sb_frextents += rtxdelta; > + spin_unlock(&mp->m_sb_lock); > + } Hmmmm. This wants a comment in xfs_log_sb() to explain why we aren't updating mp->m_sb.sb_frextents from mp->m_frextents like we do with all the other per-cpu counters tracking resource usage. > > if (tp->t_dblocks_delta) { > be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta); > @@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb( > if (ifreedelta) > percpu_counter_add(&mp->m_ifree, ifreedelta); > > - if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY)) > + if (rtxdelta) { > + error = xfs_mod_frextents(mp, rtxdelta); > + ASSERT(!error); > + } > + > + if (!(tp->t_flags & XFS_TRANS_SB_DIRTY)) > return; > > /* apply remaining deltas */ > @@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb( > mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta; > mp->m_sb.sb_icount += idelta; > mp->m_sb.sb_ifree += ifreedelta; > - mp->m_sb.sb_frextents += rtxdelta; This makes my head hurt trying to work out if this is necessary or not. (the lazy sb stuff in these functions has always strained my cognitive abilities, even though I wrote it in the first place!) A comment explaining why we don't need to update mp->m_sb.sb_frextents when XFS_TRANS_SB_DIRTY is set would be useful in the above if (rtxdelta) update above. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx