Re: [PATCH] xfs: don't use in-core per-cpu fdblocks for !lazysbcount

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 17 Apr 2021 11:57:02 +1000

On Fri, Apr 16, 2021 at 05:19:41PM -0700, Darrick J. Wong wrote:
> On Sat, Apr 17, 2021 at 05:13:20AM +0800, Gao Xiang wrote:
> > Hi Darrick,
> > 
> > On Fri, Apr 16, 2021 at 09:00:13AM -0700, Darrick J. Wong wrote:
> > > On Fri, Apr 16, 2021 at 05:10:23PM +0800, Gao Xiang wrote:
.....
> > > Are you saying that because the f46e commit removed the xfs_sync_sb
> > > calls from unmountfs for !lazysb filesystems, we no longer log the
> > > summary counters at unmount?  Which means that we no longer write the
> > > incore percpu fdblocks count to disk at unmount after we've torn down
> > > all the incore space reservations (when sb_fdblocks == m_fdblocks)?
> > 
> > Er.. I think that is by reverse, before commit f46e, we no longer logged
> > the summary counters at unmount, due to 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/xfs_mount.c?h=v5.11#n1177
> >   xfs_unmountfs
> >     -> xfs_log_sbcount
> >       -> !xfs_sb_version_haslazysbcount
> >         -> return 0 (xfs_sync_sb bypassed).
> > 
> > So the only time we update the ondisk fdblocks was during transactions,
> > but xfs_log_sb() corrupted this (due to no summary counters logging at
> > unmount).
> 
> *OH* ok, so this isn't a fix for a regression in Brian's log covering
> refactoring series that went into 5.12; this is a fix for a years old
> bug that may very well have been there since the introduction of ...
> delayed allocation?  I guess?

No, xfs_trans_apply_sb_deltas() modifies the counters in the on disk
superblock buffer directly. These counters don't contain the
in-memory reservations for things like delalloc that are held in the
counters in the xfs_mount (either percpu or mp->m_sb.sb_*).

The whole point of lazy superblock counters was avoiding this disk
buffer update, because it required taking the superblock buffer lock
and that serialised every transaction commit that did
allocation/freeing. This serialised the entire transaction system,
and effectively globally limited XFS to around 20,000 commits/sec....

The original code was this:

/*
 * xfs_log_sbcount
 *
 * Called either periodically to keep the on disk superblock values
 * roughly up to date or from unmount to make sure the values are
 * correct on a clean unmount.
 *
 * Note this code can be called during the process of freezing, so
 * we may need to use the transaction allocator which does not not
 * block when the transaction subsystem is in its frozen state.
 */
int
xfs_log_sbcount(
       xfs_mount_t     *mp,
       uint            sync)
{
       xfs_trans_t     *tp;
       int             error;

       if (!xfs_fs_writable(mp))
               return 0;

       xfs_icsb_sync_counters(mp);

       /*
        * we don't need to do this if we are updating the superblock
        * counters on every modification.
        */
       if (!xfs_sb_version_haslazysbcount(&mp->m_sb))
               return 0;

       tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT);
       error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
                                       XFS_DEFAULT_LOG_COUNT);
       if (error) {
               xfs_trans_cancel(tp, 0);
               return error;
       }

       xfs_mod_sb(tp, XFS_SB_IFREE | XFS_SB_ICOUNT | XFS_SB_FDBLOCKS);
       if (sync)
               xfs_trans_set_sync(tp);
       xfs_trans_commit(tp, 0);

       return 0;
}

And that slowly ended up being modified and merged into
xfs_log_sb()/xfs_sync_sb() and xfs_log_sb() gained more callers that
didn't have the lazysbcount check. i.e. we lost the original reason
for doing this counter update, and so we end up using it
incorrectly.  Brian's log covering rework got rid of the explicit
xfs_log_sbcount() call in the unmount path and replaced it with this
"log covering is not needed sometimes when lazysbcounters are not in
use". So now there isn't any of the original "counter updates are
only necessary for lazysbcount filesystems" code left...

Going back further, we always used to write the superblock at
unmount, and this always used to happen after all the reservations
had been released. Hence it didn't matter whether lazy-sb was
enabled or not, the unmount always wrote out the correct value to
the superblock.

This explicit superblock write was removed from xfs_unmountfs() back
in 2012 by commit 211e4d434bd7 ("xfs: implement freezing by emptying
the AIL") where is was replaced in the unmount path by a call to
xfs_ail_push_all_sync(mp->m_ail). With an explicit call to
xfs_log_sbcount() in the unmountfs path, it was clear that this
would result in the superblock getting written if counters needed
syncing.

IOWs, these changes in 2012 meant that if the superblock is not
dirty, we don't actually write it at unmount. Which means that, for
a non-lazy sb counter filesyste, if the last thing to modify the
superblock was a call to xfs_log_sbcount(), there's every chance
that the superblock that is written back contains the in-memory
values in it from the percpu counters, not the value from
mp->m_sb.sb_fdblocks....

It's only taken ~10 years of testing to find an escape on a clean
unmount.... :/

So, yeah, this seems like a combination of changes over a period of
years adding up to incorrectly logging the superblock counts and not
noticing because unmount on lazycount filesystems rewrites them to
be correct on a clean unmount. BUt with lazysbcount == 0, we've
ended up with an escape because we don't actually update and write
the superblock at unmount anymore.

> > But I have no idea xfs_log_need_covered(mp) is always true at that time,
> > and the patchset seems a bit large and (possibly) hard to backport...
> 
> I wouldn't backport that to a stable series. :)

Nor is it necessary to fix the problem.

> > > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > > index 60e6d255e5e2..423dada3f64c 100644
> > > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > > @@ -928,7 +928,13 @@ xfs_log_sb(
> > > >  
> > > >  	mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
> > > >  	mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
> > > > -	mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> > > > +	if (!xfs_sb_version_haslazysbcount(&mp->m_sb)) {
> > > > +		struct xfs_dsb	*dsb = bp->b_addr;
> > > > +
> > > > +		mp->m_sb.sb_fdblocks = be64_to_cpu(dsb->sb_fdblocks);
> 
> Hmm... is this really needed?  I thought in !lazysbcount mode,
> xfs_trans_apply_sb_deltas updates the ondisk super buffer directly.
> So aren't all three of these updates unnecessary?

Yup, now I understand the issue, the fix is simply to avoid these
updates for !lazysb. i.e. it should just be:

	if (xfs_sb_version_haslazysbcount(&mp->m_sb)) {
		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
	}
	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx