Re: [PATCH 09/14] xfs: document online file metadata repair code

Allison Henderson <allison.henderson@xxxxxxxxxx> · Sat, 25 Feb 2023 07:32:58 +0000



On Thu, 2023-02-16 at 15:05 -0800, Darrick J. Wong wrote:
> On Thu, Feb 16, 2023 at 03:46:30PM +0000, Allison Henderson wrote:
> > On Sun, 2022-10-02 at 11:19 -0700, Darrick J. Wong wrote:
> 
> Not sure why this reply is to the earlier submission of the design
> doc,
> but oh well, it didn't change much between October and December of
> 2022.
oops, it just because I search the inbox for the next patch in the
series and inadvertently clicked the earlier one
> 
> > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > Add to the fifth chapter of the online fsck design documentation,
> > > where
> > > we discuss the details of the data structures and algorithms used
> > > by
> > > the
> > > kernel to repair file metadata.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  150
> > > ++++++++++++++++++++
> > >  1 file changed, 150 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index c41f089549a0..10709dc74dcb 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -2872,3 +2872,153 @@ The allocation group free block list
> > > (AGFL)
> > > is repaired as follows:
> > >  4. Once the AGFL is full, reap any blocks leftover.
> > >  
> > >  5. The next operation to fix the freelist will right-size the
> > > list.
> > > +
> > > +Inode Record Repairs
> > > +--------------------
> > > +
> > > +Inode records must be handled carefully, because they have both
> > > ondisk records
> > > +("dinodes") and an in-memory ("cached") representation.
> > > +There is a very high potential for cache coherency issues if
> > > online
> > > fsck is not
> > > +careful to access the ondisk metadata *only* when the ondisk
> > > metadata is so
> > > +badly damaged that the filesystem cannot load the in-memory
> > > representation.
> > > +When online fsck wants to open a damaged file for scrubbing, it
> > > must
> > > use
> > > +specialized resource acquisition functions that return either
> > > the
> > > in-memory
> > > +representation *or* a lock on whichever object is necessary to
> > > prevent any
> > > +update to the ondisk location.
> > > +
> > > +The only repairs that should be made to the ondisk inode buffers
> > > are
> > > whatever
> > > +is necessary to get the in-core structure loaded.
> > > +This means fixing whatever is caught by the inode cluster buffer
> > > and
> > > inode fork
> > > +verifiers, and retrying the ``iget`` operation.
> > > +If the second ``iget`` fails, the repair has failed.
> > > +
> > > +Once the in-memory representation is loaded, repair can lock the
> > > inode and can
> > > +subject it to comprehensive checks, repairs, and optimizations.
> > > +Most inode attributes are easy to check and constrain, or are
> > > user-
> > > controlled
> > > +arbitrary bit patterns; these are both easy to fix.
> > > +Dealing with the data and attr fork extent counts and the file
> > > block
> > > counts is
> > > +more complicated, because computing the correct value requires
> > > traversing the
> > > +forks, or if that fails, leaving the fields invalid and waiting
> > > for
> > > the fork
> > > +fsck functions to run.
> > > +
> > > +The proposed patchset is the
> > > +`inode
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-inodes>`_
> > > +repair series.
> > > +
> > > +Quota Record Repairs
> > > +--------------------
> > > +
> > > +Similar to inodes, quota records ("dquots") also have both
> > > ondisk
> > > records and
> > > +an in-memory representation, and hence are subject to the same
> > > cache
> > > coherency
> > > +issues.
> > > +Somewhat confusingly, both are known as dquots in the XFS
> > > codebase.
> > > +
> > > +The only repairs that should be made to the ondisk quota record
> > > buffers are
> > > +whatever is necessary to get the in-core structure loaded.
> > > +Once the in-memory representation is loaded, the only attributes
> > > needing
> > > +checking are obviously bad limits and timer values.
> > > +
> > > +Quota usage counters are checked, repaired, and discussed
> > > separately
> > > in the
> > > +section about :ref:`live quotacheck <quotacheck>`.
> > > +
> > > +The proposed patchset is the
> > > +`quota
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-quota>`_
> > > +repair series.
> > > +
> > > +.. _fscounters:
> > > +
> > > +Freezing to Fix Summary Counters
> > > +--------------------------------
> > > +
> > > +Filesystem summary counters track availability of filesystem
> > > resources such
> > > +as free blocks, free inodes, and allocated inodes.
> > > +This information could be compiled by walking the free space and
> > > inode indexes,
> > > +but this is a slow process, so XFS maintains a copy in the
> > > ondisk
> > > superblock
> > > +that should reflect the ondisk metadata, at least when the
> > > filesystem has been
> > > +unmounted cleanly.
> > > +For performance reasons, XFS also maintains incore copies of
> > > those
> > > counters,
> > > +which are key to enabling resource reservations for active
> > > transactions.
> > > +Writer threads reserve the worst-case quantities of resources
> > > from
> > > the
> > > +incore counter and give back whatever they don't use at commit
> > > time.
> > > +It is therefore only necessary to serialize on the superblock
> > > when
> > > the
> > > +superblock is being committed to disk.
> > > +
> > > +The lazy superblock counter feature introduced in XFS v5 took
> > > this
> > > even further
> > > +by training log recovery to recompute the summary counters from
> > > the
> > > AG headers,
> > > +which eliminated the need for most transactions even to touch
> > > the
> > > superblock.
> > > +The only time XFS commits the summary counters is at filesystem
> > > unmount.
> > > +To reduce contention even further, the incore counter is
> > > implemented
> > > as a
> > > +percpu counter, which means that each CPU is allocated a batch
> > > of
> > > blocks from a
> > > +global incore counter and can satisfy small allocations from the
> > > local batch.
> > > +
> > > +The high-performance nature of the summary counters makes it
> > > difficult for
> > > +online fsck to check them, since there is no way to quiesce a
> > > percpu
> > > counter
> > > +while the system is running.
> > > +Although online fsck can read the filesystem metadata to compute
> > > the
> > > correct
> > > +values of the summary counters, there's no way to hold the value
> > > of
> > > a percpu
> > > +counter stable, so it's quite possible that the counter will be
> > > out
> > > of date by
> > > +the time the walk is complete.
> > > +Earlier versions of online scrub would return to userspace with
> > > an
> > > incomplete
> > > +scan flag, but this is not a satisfying outcome for a system
> > > administrator.
> > > +For repairs, the in-memory counters must be stabilize while
> > > walking
> > > the
> > nit: stablilized
> 
> Fixed, thank you.
> 
> > > +filesystem metadata to get an accurate reading and install it in
> > > the
> > > percpu
> > > +counter.
> > > +
> > > +To satisfy this requirement, online fsck must prevent other
> > > programs
> > > in the
> > > +system from initiating new writes to the filesystem, it must
> > > disable
> > > background
> > > +garbage collection threads, and it must wait for existing writer
> > > programs to
> > > +exit the kernel.
> > > +Once that has been established, scrub can walk the AG free space
> > > indexes, the
> > > +inode btrees, and the realtime bitmap to compute the correct
> > > value
> > > of all
> > > +four summary counters.
> > > +This is very similar to a filesystem freeze.
> > > +
> > > +The initial implementation used the actual VFS filesystem freeze
> > > mechanism to
> > > +quiesce filesystem activity.
> > 
> > 
> > > +With the filesystem frozen, it is possible to resolve the
> > > counter
> > > values with
> > > +exact precision, but there are many problems with calling the
> > > VFS
> > > methods
> > > +directly:
> > > +
> > > +- Other programs can unfreeze the filesystem without our
> > > knowledge.
> > > +  This leads to incorrect scan results and incorrect repairs.
> > > +
> > > +- Adding an extra lock to prevent others from thawing the
> > > filesystem
> > > required
> > > +  the addition of a ``->freeze_super`` function to wrap
> > > ``freeze_fs()``.
> > > +  This in turn caused other subtle problems because it turns out
> > > that the VFS
> > > +  ``freeze_super`` and ``thaw_super`` functions can drop the
> > > last
> > > reference to
> > > +  the VFS superblock, and any subsequent access becomes a UAF
> > > bug!
> > > +  This can happen if the filesystem is unmounted while the
> > > underlying block
> > > +  device has frozen the filesystem.
> > > +  This problem could be solved by grabbing extra references to
> > > the
> > > superblock,
> > > +  but it felt suboptimal given the other inadequacies of this
> > > approach:
> > > +
> > > +- The log need not be quiesced to check the summary counters,
> > > but a
> > > VFS freeze
> > > +  initiates one anyway.
> > > +  This adds unnecessary runtime to live fscounter fsck
> > > operations.
> > > +
> > > +- Quiescing the log means that XFS flushes the (possibly
> > > incorrect)
> > > counters to
> > > +  disk as part of cleaning the log.
> > > +
> > > +- A bug in the VFS meant that freeze could complete even when
> > > sync_filesystem
> > > +  fails to flush the filesystem and returns an error.
> > > +  This bug was fixed in Linux 5.17.
> > > +
> > > +The author 
> > 
> > I think the above is a candidate for culling or maybe the side bar
> > if
> > you really feel the need to keep it.
> 
> I'll turn it into a sidebar, since it took me a while to figure all
> that
> out, and maybe some day I'll need to refer to what amounts to
> historic
> notes.
> 
> > Aside from the fact that it was
> > simply bug prone, it makes sense to avoid an entire freeze if we
> > can
> > help it.  I think it would be fine to just pop into the current
> > proposition:
> > 
> > "However, this approach was found to be both costly and unstable. 
> > Later optimizations of ofsck established ..."
> 
> How about:
> 
> "...Once that has been established, scrub can walk the AG free space
> indexes, the inode btrees, and the realtime bitmap to compute the
> correct value of all four summary counters.  This is very similar to
> a
> filesystem freeze, though not all of the pieces are necessary:
> 
> - "The final freeze state is set one higher than
> ``SB_FREEZE_COMPLETE``
>   to prevent other threads from thawing the filesystem, or other
> scrub
>   threads from initiating another fscounters freeze.
> 
> - IIt does not quiesce the log.
> 
> "With this code in place, it is now possible to pause the filesystem
> for
> just long enough to check and correct the summary counters."
> 
> <giant historic sidebar here>
Ok, I think that's a fair compromise
> 
> ?
> 
> > > established that the only component of online fsck that requires
> > > the
> > > +ability to freeze the filesystem is the fscounter scrubber, so
> > > the
> > > code for
> > > +this could be localized to that source file.
> > > +fscounter freeze behaves the same as the VFS freeze method,
> > > except:
> > > +
> > > +- The final freeze state is set one higher than
> > > ``SB_FREEZE_COMPLETE`` to
> > > +  prevent other threads from thawing the filesystem.
> > > +
> > > +- It does not quiesce the log.
> > > +
> > > +With this code in place, it is now possible to pause the
> > > filesystem
> > > for just
> > > +long enough to check and correct the summary counters.
> > > +
> > > +The proposed patchset is the
> > > +`summary counter cleanup
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-fscounters>`_
> > > +series.
> > > 
> > 
> > Otherwise looks good.  :-)
> 
> Thanks!
> 
> --D
> 
> > Allison
> >