On Fri, Apr 08, 2022 at 09:17:25AM +1000, Dave Chinner wrote: > On Thu, Apr 07, 2022 at 01:47:02PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > As mentioned in the previous commit, the kernel misuses sb_frextents in > > the incore mount to reflect both incore reservations made by running > > transactions as well as the actual count of free rt extents on disk. > > This results in the superblock being written to the log with an > > underestimate of the number of rt extents that are marked free in the > > rtbitmap. > > > > Teaching XFS to recompute frextents after log recovery avoids > > operational problems in the current mount, but it doesn't solve the > > problem of us writing undercounted frextents which are then recovered by > > an older kernel that doesn't have that fix. > > > > Create an incore percpu counter to mirror the ondisk frextents. This > > new counter will track transaction reservations and the only time we > > will touch the incore super counter (i.e the one that gets logged) is > > when those transactions commit updates to the rt bitmap. This is in > > contrast to the lazysbcount counters (e.g. fdblocks), where we know that > > log recovery will always fix any incorrect counter that we log. > > As a bonus, we only take m_sb_lock at transaction commit time. > > Again, the concept looks fine as does most of the code. Some > comments on the implementation below. > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > > index c5f153c3693f..d5463728c305 100644 > > --- a/fs/xfs/xfs_mount.c > > +++ b/fs/xfs/xfs_mount.c > > @@ -1183,17 +1183,37 @@ xfs_mod_frextents( > > struct xfs_mount *mp, > > int64_t delta) > > { > > - int64_t lcounter; > > - int ret = 0; > > + int batch; > > > > - spin_lock(&mp->m_sb_lock); > > - lcounter = mp->m_sb.sb_frextents + delta; > > - if (lcounter < 0) > > - ret = -ENOSPC; > > + if (delta > 0) { > > + percpu_counter_add(&mp->m_frextents, delta); > > + return 0; > > + } > > + > > + /* > > + * Taking blocks away, need to be more accurate the closer we > > + * are to zero. > > + * > > + * If the counter has a value of less than 2 * max batch size, > > + * then make everything serialise as we are real close to > > + * ENOSPC. > > + */ > > + if (__percpu_counter_compare(&mp->m_frextents, 2 * XFS_FDBLOCKS_BATCH, > > + XFS_FDBLOCKS_BATCH) < 0) > > + batch = 1; > > else > > - mp->m_sb.sb_frextents = lcounter; > > - spin_unlock(&mp->m_sb_lock); > > - return ret; > > + batch = XFS_FDBLOCKS_BATCH; > > + > > + percpu_counter_add_batch(&mp->m_frextents, delta, batch); > > + if (__percpu_counter_compare(&mp->m_frextents, 0, > > + XFS_FDBLOCKS_BATCH) >= 0) { > > + /* we had space! */ > > + return 0; > > + } > > + > > + /* oops, negative free space, put that back! */ > > + percpu_counter_add(&mp->m_frextents, -delta); > > + return -ENOSPC; > > } > > Ok, so this looks like a copy-pasta of xfs_mod_fdblocks() with the > reservation pool stuff stripped. I'd kinda prefer to factor > xfs_mod_fdblocks() so that we aren't blowing out the instruction > cache footprint on this hot path - we're frequently modifying > both RT and fd block counts in the same transaction, so having them > run the same code would be good. > > Something like: > > int > xfs_mod_blocks( > struct xfs_mount *mp, > struct pcp_counter *pccnt, > int64_t delta, > bool use_resv_pool, > bool rsvd) > { > int64_t lcounter; > long long res_used; > s32 batch; > uint64_t set_aside = 0; > > if (delta > 0) { > /* > * If the reserve pool is depleted, put blocks back into it > * first. Most of the time the pool is full. > */ > if (likely(!use_resv_pool || mp->m_resblks == mp->m_resblks_avail)) { > percpu_counter_add(pccnt, delta); > return 0; > } > > spin_lock(&mp->m_sb_lock); > res_used = (long long)(mp->m_resblks - mp->m_resblks_avail); > > if (res_used > delta) { > mp->m_resblks_avail += delta; > } else { > delta -= res_used; > mp->m_resblks_avail = mp->m_resblks; > percpu_counter_add(&mp->m_fdblocks, delta); > } > spin_unlock(&mp->m_sb_lock); > return 0; > } > > /* > * Taking blocks away, need to be more accurate the closer we > * are to zero. > * > * If the counter has a value of less than 2 * max batch size, > * then make everything serialise as we are real close to > * ENOSPC. > */ > if (__percpu_counter_compare(pccnt, 2 * XFS_FDBLOCKS_BATCH, > XFS_FDBLOCKS_BATCH) < 0) > batch = 1; > else > batch = XFS_FDBLOCKS_BATCH; > > /* > * Set aside allocbt blocks because these blocks are tracked as free > * space but not available for allocation. Technically this means that a > * single reservation cannot consume all remaining free space, but the > * ratio of allocbt blocks to usable free blocks should be rather small. > * The tradeoff without this is that filesystems that maintain high > * perag block reservations can over reserve physical block availability > * and fail physical allocation, which leads to much more serious > * problems (i.e. transaction abort, pagecache discards, etc.) than > * slightly premature -ENOSPC. > */ > if (use_resv_pool) > set_aside = xfs_fdblocks_unavailable(mp); > percpu_counter_add_batch(pccnt, delta, batch); > if (__percpu_counter_compare(&pccnt, set_aside, > XFS_FDBLOCKS_BATCH) >= 0) { > /* we had space! */ > return 0; > } > > /* > * lock up the sb for dipping into reserves before releasing the space > * that took us to ENOSPC. > */ > spin_lock(&mp->m_sb_lock); > percpu_counter_add(pccnt, -delta); > if (!use_resv_pool || !rsvd) > goto fdblocks_enospc; > > lcounter = (long long)mp->m_resblks_avail + delta; > if (lcounter >= 0) { > mp->m_resblks_avail = lcounter; > spin_unlock(&mp->m_sb_lock); > return 0; > } > xfs_warn_once(mp, > "Reserve blocks depleted! Consider increasing reserve pool size."); > > fdblocks_enospc: > spin_unlock(&mp->m_sb_lock); > return -ENOSPC; > } > > And in the relevant header file: > > int xfs_mod_blocks(struct xfs_mount *mp, struct pcp_counter *pccnt, > int64_t delta, bool use_resv_pool, bool rsvd); > > static inline int > xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, bool rsvd) > { > return xfs_mod_blocks(mp, &mp->m_fdblocks, delta, true, resvd); > } > > static inline int > xfs_mod_frextents(struct xfs_mount *mp, int64_t delta) > { > return xfs_mod_blocks(mp, &mp->m_frextents, delta, false, false); > } Looks good to me. > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > index 54be9d64093e..cc95768eb8e1 100644 > > --- a/fs/xfs/xfs_super.c > > +++ b/fs/xfs/xfs_super.c > > @@ -843,9 +843,11 @@ xfs_fs_statfs( > > > > if (XFS_IS_REALTIME_MOUNT(mp) && > > (ip->i_diflags & (XFS_DIFLAG_RTINHERIT | XFS_DIFLAG_REALTIME))) { > > + s64 freertx; > > + > > statp->f_blocks = sbp->sb_rblocks; > > - statp->f_bavail = statp->f_bfree = > > - sbp->sb_frextents * sbp->sb_rextsize; > > + freertx = max_t(s64, 0, percpu_counter_sum(&mp->m_frextents)); > > percpu_counter_sum_positive() Ok. > > if (error) > > goto free_fdblocks; > > > > + error = percpu_counter_init(&mp->m_frextents, 0, GFP_KERNEL); > > + if (error) > > + goto free_delalloc; > > + > > return 0; > > > > +free_delalloc: > > + percpu_counter_destroy(&mp->m_delalloc_blks); > > free_fdblocks: > > percpu_counter_destroy(&mp->m_fdblocks); > > free_ifree: > > @@ -1033,6 +1041,7 @@ xfs_reinit_percpu_counters( > > percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount); > > percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree); > > percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks); > > + percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents); > > } > > > > static void > > @@ -1045,6 +1054,7 @@ xfs_destroy_percpu_counters( > > ASSERT(xfs_is_shutdown(mp) || > > percpu_counter_sum(&mp->m_delalloc_blks) == 0); > > percpu_counter_destroy(&mp->m_delalloc_blks); > > + percpu_counter_destroy(&mp->m_frextents); > > } > > > > static int > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > > index 0ac717aad380..63a4d3a24340 100644 > > --- a/fs/xfs/xfs_trans.c > > +++ b/fs/xfs/xfs_trans.c > > @@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas( > > be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta); > > } > > > > - if (tp->t_frextents_delta) > > - be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta); > > - if (tp->t_res_frextents_delta) > > - be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta); > > + /* > > + * Updating frextents requires careful handling because it does not > > + * behave like the lazysb counters because we cannot rely on log > > + * recovery in older kenels to recompute the value from the rtbitmap. > > + * This means that the ondisk frextents must be consistent with the > > + * rtbitmap. > > + * > > + * Therefore, log the frextents change to the ondisk superblock and > > + * update the incore superblock so that future calls to xfs_log_sb > > + * write the correct value ondisk. > > + * > > + * Don't touch m_frextents because it includes incore reservations, > > + * and those are handled by the unreserve function. > > + */ > > + if (tp->t_frextents_delta || tp->t_res_frextents_delta) { > > + struct xfs_mount *mp = tp->t_mountp; > > + int64_t rtxdelta; > > + > > + rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta; > > + > > + spin_lock(&mp->m_sb_lock); > > + be64_add_cpu(&sbp->sb_frextents, rtxdelta); > > + mp->m_sb.sb_frextents += rtxdelta; > > + spin_unlock(&mp->m_sb_lock); > > + } > > Hmmmm. This wants a comment in xfs_log_sb() to explain why we > aren't updating mp->m_sb.sb_frextents from mp->m_frextents like we > do with all the other per-cpu counters tracking resource usage. Ok. How about this for xfs_log_sb: /* * Do not update sb_frextents here because it is not part of the lazy sb * counters (despite having a percpu counter) and therefore must be * consistent with the ondisk rtbitmap. */ > > > > if (tp->t_dblocks_delta) { > > be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta); > > @@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb( > > if (ifreedelta) > > percpu_counter_add(&mp->m_ifree, ifreedelta); > > > > - if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY)) > > + if (rtxdelta) { > > + error = xfs_mod_frextents(mp, rtxdelta); > > + ASSERT(!error); > > + } > > + > > + if (!(tp->t_flags & XFS_TRANS_SB_DIRTY)) > > return; > > > > /* apply remaining deltas */ > > @@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb( > > mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta; > > mp->m_sb.sb_icount += idelta; > > mp->m_sb.sb_ifree += ifreedelta; > > - mp->m_sb.sb_frextents += rtxdelta; > > This makes my head hurt trying to work out if this is necessary or > not. (the lazy sb stuff in these functions has always strained my > cognitive abilities, even though I wrote it in the first place!) > > A comment explaining why we don't need to update > mp->m_sb.sb_frextents when XFS_TRANS_SB_DIRTY is set would be useful > in the above if (rtxdelta) update above. And how about this? /* * Do not touch sb_frextents here because we are dealing with incore * reservation. sb_frextents is not part of the lazy sb counters so it * must be consistent with the ondisk rtibitmap and must never include * incore reservations. */ --D > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx