On Thu, 2023-02-16 at 15:05 -0800, Darrick J. Wong wrote: > On Thu, Feb 16, 2023 at 03:46:30PM +0000, Allison Henderson wrote: > > On Sun, 2022-10-02 at 11:19 -0700, Darrick J. Wong wrote: > > Not sure why this reply is to the earlier submission of the design > doc, > but oh well, it didn't change much between October and December of > 2022. oops, it just because I search the inbox for the next patch in the series and inadvertently clicked the earlier one > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > Add to the fifth chapter of the online fsck design documentation, > > > where > > > we discuss the details of the data structures and algorithms used > > > by > > > the > > > kernel to repair file metadata. > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > --- > > > .../filesystems/xfs-online-fsck-design.rst | 150 > > > ++++++++++++++++++++ > > > 1 file changed, 150 insertions(+) > > > > > > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > > > b/Documentation/filesystems/xfs-online-fsck-design.rst > > > index c41f089549a0..10709dc74dcb 100644 > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > > > @@ -2872,3 +2872,153 @@ The allocation group free block list > > > (AGFL) > > > is repaired as follows: > > > 4. Once the AGFL is full, reap any blocks leftover. > > > > > > 5. The next operation to fix the freelist will right-size the > > > list. > > > + > > > +Inode Record Repairs > > > +-------------------- > > > + > > > +Inode records must be handled carefully, because they have both > > > ondisk records > > > +("dinodes") and an in-memory ("cached") representation. > > > +There is a very high potential for cache coherency issues if > > > online > > > fsck is not > > > +careful to access the ondisk metadata *only* when the ondisk > > > metadata is so > > > +badly damaged that the filesystem cannot load the in-memory > > > representation. > > > +When online fsck wants to open a damaged file for scrubbing, it > > > must > > > use > > > +specialized resource acquisition functions that return either > > > the > > > in-memory > > > +representation *or* a lock on whichever object is necessary to > > > prevent any > > > +update to the ondisk location. > > > + > > > +The only repairs that should be made to the ondisk inode buffers > > > are > > > whatever > > > +is necessary to get the in-core structure loaded. > > > +This means fixing whatever is caught by the inode cluster buffer > > > and > > > inode fork > > > +verifiers, and retrying the ``iget`` operation. > > > +If the second ``iget`` fails, the repair has failed. > > > + > > > +Once the in-memory representation is loaded, repair can lock the > > > inode and can > > > +subject it to comprehensive checks, repairs, and optimizations. > > > +Most inode attributes are easy to check and constrain, or are > > > user- > > > controlled > > > +arbitrary bit patterns; these are both easy to fix. > > > +Dealing with the data and attr fork extent counts and the file > > > block > > > counts is > > > +more complicated, because computing the correct value requires > > > traversing the > > > +forks, or if that fails, leaving the fields invalid and waiting > > > for > > > the fork > > > +fsck functions to run. > > > + > > > +The proposed patchset is the > > > +`inode > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-inodes>`_ > > > +repair series. > > > + > > > +Quota Record Repairs > > > +-------------------- > > > + > > > +Similar to inodes, quota records ("dquots") also have both > > > ondisk > > > records and > > > +an in-memory representation, and hence are subject to the same > > > cache > > > coherency > > > +issues. > > > +Somewhat confusingly, both are known as dquots in the XFS > > > codebase. > > > + > > > +The only repairs that should be made to the ondisk quota record > > > buffers are > > > +whatever is necessary to get the in-core structure loaded. > > > +Once the in-memory representation is loaded, the only attributes > > > needing > > > +checking are obviously bad limits and timer values. > > > + > > > +Quota usage counters are checked, repaired, and discussed > > > separately > > > in the > > > +section about :ref:`live quotacheck <quotacheck>`. > > > + > > > +The proposed patchset is the > > > +`quota > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-quota>`_ > > > +repair series. > > > + > > > +.. _fscounters: > > > + > > > +Freezing to Fix Summary Counters > > > +-------------------------------- > > > + > > > +Filesystem summary counters track availability of filesystem > > > resources such > > > +as free blocks, free inodes, and allocated inodes. > > > +This information could be compiled by walking the free space and > > > inode indexes, > > > +but this is a slow process, so XFS maintains a copy in the > > > ondisk > > > superblock > > > +that should reflect the ondisk metadata, at least when the > > > filesystem has been > > > +unmounted cleanly. > > > +For performance reasons, XFS also maintains incore copies of > > > those > > > counters, > > > +which are key to enabling resource reservations for active > > > transactions. > > > +Writer threads reserve the worst-case quantities of resources > > > from > > > the > > > +incore counter and give back whatever they don't use at commit > > > time. > > > +It is therefore only necessary to serialize on the superblock > > > when > > > the > > > +superblock is being committed to disk. > > > + > > > +The lazy superblock counter feature introduced in XFS v5 took > > > this > > > even further > > > +by training log recovery to recompute the summary counters from > > > the > > > AG headers, > > > +which eliminated the need for most transactions even to touch > > > the > > > superblock. > > > +The only time XFS commits the summary counters is at filesystem > > > unmount. > > > +To reduce contention even further, the incore counter is > > > implemented > > > as a > > > +percpu counter, which means that each CPU is allocated a batch > > > of > > > blocks from a > > > +global incore counter and can satisfy small allocations from the > > > local batch. > > > + > > > +The high-performance nature of the summary counters makes it > > > difficult for > > > +online fsck to check them, since there is no way to quiesce a > > > percpu > > > counter > > > +while the system is running. > > > +Although online fsck can read the filesystem metadata to compute > > > the > > > correct > > > +values of the summary counters, there's no way to hold the value > > > of > > > a percpu > > > +counter stable, so it's quite possible that the counter will be > > > out > > > of date by > > > +the time the walk is complete. > > > +Earlier versions of online scrub would return to userspace with > > > an > > > incomplete > > > +scan flag, but this is not a satisfying outcome for a system > > > administrator. > > > +For repairs, the in-memory counters must be stabilize while > > > walking > > > the > > nit: stablilized > > Fixed, thank you. > > > > +filesystem metadata to get an accurate reading and install it in > > > the > > > percpu > > > +counter. > > > + > > > +To satisfy this requirement, online fsck must prevent other > > > programs > > > in the > > > +system from initiating new writes to the filesystem, it must > > > disable > > > background > > > +garbage collection threads, and it must wait for existing writer > > > programs to > > > +exit the kernel. > > > +Once that has been established, scrub can walk the AG free space > > > indexes, the > > > +inode btrees, and the realtime bitmap to compute the correct > > > value > > > of all > > > +four summary counters. > > > +This is very similar to a filesystem freeze. > > > + > > > +The initial implementation used the actual VFS filesystem freeze > > > mechanism to > > > +quiesce filesystem activity. > > > > > > > +With the filesystem frozen, it is possible to resolve the > > > counter > > > values with > > > +exact precision, but there are many problems with calling the > > > VFS > > > methods > > > +directly: > > > + > > > +- Other programs can unfreeze the filesystem without our > > > knowledge. > > > + This leads to incorrect scan results and incorrect repairs. > > > + > > > +- Adding an extra lock to prevent others from thawing the > > > filesystem > > > required > > > + the addition of a ``->freeze_super`` function to wrap > > > ``freeze_fs()``. > > > + This in turn caused other subtle problems because it turns out > > > that the VFS > > > + ``freeze_super`` and ``thaw_super`` functions can drop the > > > last > > > reference to > > > + the VFS superblock, and any subsequent access becomes a UAF > > > bug! > > > + This can happen if the filesystem is unmounted while the > > > underlying block > > > + device has frozen the filesystem. > > > + This problem could be solved by grabbing extra references to > > > the > > > superblock, > > > + but it felt suboptimal given the other inadequacies of this > > > approach: > > > + > > > +- The log need not be quiesced to check the summary counters, > > > but a > > > VFS freeze > > > + initiates one anyway. > > > + This adds unnecessary runtime to live fscounter fsck > > > operations. > > > + > > > +- Quiescing the log means that XFS flushes the (possibly > > > incorrect) > > > counters to > > > + disk as part of cleaning the log. > > > + > > > +- A bug in the VFS meant that freeze could complete even when > > > sync_filesystem > > > + fails to flush the filesystem and returns an error. > > > + This bug was fixed in Linux 5.17. > > > + > > > +The author > > > > I think the above is a candidate for culling or maybe the side bar > > if > > you really feel the need to keep it. > > I'll turn it into a sidebar, since it took me a while to figure all > that > out, and maybe some day I'll need to refer to what amounts to > historic > notes. > > > Aside from the fact that it was > > simply bug prone, it makes sense to avoid an entire freeze if we > > can > > help it. I think it would be fine to just pop into the current > > proposition: > > > > "However, this approach was found to be both costly and unstable. > > Later optimizations of ofsck established ..." > > How about: > > "...Once that has been established, scrub can walk the AG free space > indexes, the inode btrees, and the realtime bitmap to compute the > correct value of all four summary counters. This is very similar to > a > filesystem freeze, though not all of the pieces are necessary: > > - "The final freeze state is set one higher than > ``SB_FREEZE_COMPLETE`` > to prevent other threads from thawing the filesystem, or other > scrub > threads from initiating another fscounters freeze. > > - IIt does not quiesce the log. > > "With this code in place, it is now possible to pause the filesystem > for > just long enough to check and correct the summary counters." > > <giant historic sidebar here> Ok, I think that's a fair compromise > > ? > > > > established that the only component of online fsck that requires > > > the > > > +ability to freeze the filesystem is the fscounter scrubber, so > > > the > > > code for > > > +this could be localized to that source file. > > > +fscounter freeze behaves the same as the VFS freeze method, > > > except: > > > + > > > +- The final freeze state is set one higher than > > > ``SB_FREEZE_COMPLETE`` to > > > + prevent other threads from thawing the filesystem. > > > + > > > +- It does not quiesce the log. > > > + > > > +With this code in place, it is now possible to pause the > > > filesystem > > > for just > > > +long enough to check and correct the summary counters. > > > + > > > +The proposed patchset is the > > > +`summary counter cleanup > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-fscounters>`_ > > > +series. > > > > > > > Otherwise looks good. :-) > > Thanks! > > --D > > > Allison > >