On Thu, 2023-02-16 at 14:48 -0800, Darrick J. Wong wrote: > On Thu, Feb 16, 2023 at 03:47:20PM +0000, Allison Henderson wrote: > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > Certain parts of the online fsck code need to scan every file in > > > the > > > entire filesystem. It is not acceptable to block the entire > > > filesystem > > > while this happens, which means that we need to be clever in > > > allowing > > > scans to coordinate with ongoing filesystem updates. We also > > > need to > > > hook the filesystem so that regular updates propagate to the > > > staging > > > records. > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > --- > > > .../filesystems/xfs-online-fsck-design.rst | 677 > > > ++++++++++++++++++++ > > > 1 file changed, 677 insertions(+) > > > > > > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > > > b/Documentation/filesystems/xfs-online-fsck-design.rst > > > index a658da8fe4ae..c0f08a773f08 100644 > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > > > @@ -3018,3 +3018,680 @@ The proposed patchset is the > > > `summary counter cleanup > > > < > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-fscounters>`_ > > > series. > > > + > > > +Full Filesystem Scans > > > +--------------------- > > > + > > > +Certain types of metadata can only be checked by walking every > > > file > > > in the > > > +entire filesystem to record observations and comparing the > > > observations against > > > +what's recorded on disk. > > > +Like every other type of online repair, repairs are made by > > > writing > > > those > > > +observations to disk in a replacement structure and committing > > > it > > > atomically. > > > +However, it is not practical to shut down the entire filesystem > > > to > > > examine > > > +hundreds of billions of files because the downtime would be > > > excessive. > > > +Therefore, online fsck must build the infrastructure to manage a > > > live scan of > > > +all the files in the filesystem. > > > +There are two questions that need to be solved to perform a live > > > walk: > > > + > > > +- How does scrub manage the scan while it is collecting data? > > > + > > > +- How does the scan keep abreast of changes being made to the > > > system > > > by other > > > + threads? > > > + > > > +.. _iscan: > > > + > > > +Coordinated Inode Scans > > > +``````````````````````` > > > + > > > +In the original Unix filesystems of the 1970s, each directory > > > entry > > > contained > > > +an index number (*inumber*) which was used as an index into on > > > ondisk array > > > +(*itable*) of fixed-size records (*inodes*) describing a file's > > > attributes and > > > +its data block mapping. > > > +This system is described by J. Lions, `"inode (5659)" > > > +<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' > > > Commentary on > > > +UNIX, 6th Edition*, (Dept. of Computer Science, the University > > > of > > > New South > > > +Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. > > > Thompson, > > > +`"Implementation of the File System" > > > +<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, > > > from > > > *The UNIX > > > +Time-Sharing System*, (The Bell System Technical Journal, July > > > 1978), pp. > > > +1913-4. > > > + > > > +XFS retains most of this design, except now inumbers are search > > > keys > > > over all > > > +the space in the data section filesystem. > > > +They form a continuous keyspace that can be expressed as a 64- > > > bit > > > integer, > > > +though the inodes themselves are sparsely distributed within the > > > keyspace. > > > +Scans proceed in a linear fashion across the inumber keyspace, > > > starting from > > > +``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. > > > +Naturally, a scan through a keyspace requires a scan cursor > > > object > > > to track the > > > +scan progress. > > > +Because this keyspace is sparse, this cursor contains two parts. > > > +The first part of this scan cursor object tracks the inode that > > > will > > > be > > > +examined next; call this the examination cursor. > > > +Somewhat less obviously, the scan cursor object must also track > > > which parts of > > > +the keyspace have already been visited, which is critical for > > > deciding if a > > > +concurrent filesystem update needs to be incorporated into the > > > scan > > > data. > > > +Call this the visited inode cursor. > > > + > > > +Advancing the scan cursor is a multi-step process encapsulated > > > in > > > +``xchk_iscan_iter``: > > > + > > > +1. Lock the AGI buffer of the AG containing the inode pointed to > > > by > > > the visited > > > + inode cursor. > > > + This guarantee that inodes in this AG cannot be allocated or > > > freed while > > > + advancing the cursor. > > > + > > > +2. Use the per-AG inode btree to look up the next inumber after > > > the > > > one that > > > + was just visited, since it may not be keyspace adjacent. > > > + > > > +3. If there are no more inodes left in this AG: > > > + > > > + a. Move the examination cursor to the point of the inumber > > > keyspace that > > > + corresponds to the start of the next AG. > > > + > > > + b. Adjust the visited inode cursor to indicate that it has > > > "visited" the > > > + last possible inode in the current AG's inode keyspace. > > > + XFS inumbers are segmented, so the cursor needs to be > > > marked > > > as having > > > + visited the entire keyspace up to just before the start of > > > the > > > next AG's > > > + inode keyspace. > > > + > > > + c. Unlock the AGI and return to step 1 if there are > > > unexamined > > > AGs in the > > > + filesystem. > > > + > > > + d. If there are no more AGs to examine, set both cursors to > > > the > > > end of the > > > + inumber keyspace. > > > + The scan is now complete. > > > + > > > +4. Otherwise, there is at least one more inode to scan in this > > > AG: > > > + > > > + a. Move the examination cursor ahead to the next inode marked > > > as > > > allocated > > > + by the inode btree. > > > + > > > + b. Adjust the visited inode cursor to point to the inode just > > > prior to where > > > + the examination cursor is now. > > > + Because the scanner holds the AGI buffer lock, no inodes > > > could > > > have been > > > + created in the part of the inode keyspace that the visited > > > inode cursor > > > + just advanced. > > > + > > > +5. Get the incore inode for the inumber of the examination > > > cursor. > > > + By maintaining the AGI buffer lock until this point, the > > > scanner > > > knows that > > > + it was safe to advance the examination cursor across the > > > entire > > > keyspace, > > > + and that it has stabilized this next inode so that it cannot > > > disappear from > > > + the filesystem until the scan releases the incore inode. > > > + > > > +6. Drop the AGI lock and return the incore inode to the caller. > > > + > > > +Online fsck functions scan all files in the filesystem as > > > follows: > > > + > > > +1. Start a scan by calling ``xchk_iscan_start``. > > Hmm, I actually did not find xchk_iscan_start in the below branch, > > I > > found xchk_iscan_iter in "xfs: implement live inode scan for > > scrub", > > but it doesnt look like anything uses it yet, at least not in that > > branch. > > <nod> The topic branch linked below has the implementation, but no > users. The first user is online quotacheck, which is in the next > branch > after that: > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck > > Specifically, this patch: > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-quotacheck&id=3640515b9282514d91a407b6aa8d8b73caa123c5 > > I'll restate what you probably saw in the commit message for this > email discussion: > > This "one branch to introduce a new infrastructure and a second > branch > to actually use it" pattern is a result of reviewer requests for > smaller > more focused branches. This has turned out to be useful in practice > because it's easier to move just these pieces up and down in the > branch > as needed. The inode scan was originally developed for rmapbt repair > (which comes *much* later) and moved it up once I realized that > quotacheck has far fewer dependencies and hence all of this could > come > earlier. > > You're right that this section ought to point to an actual user of > the > functionality. Will fix. :) Alrighty then, sounds good > > > Also, it took me a bit to figure out that "initial user" meant > > "calling > > function" > > Er... are you talking about the sentence "...new code is split out as > a > separate patch from its initial user" in the patch commit message? > > Maybe I should reword that: > > "This new code is a separate patch from the patches adding callers > for > the sake of enabling the author to move patches around his tree..." Yes, I think that's clearer :-) > > > > + > > > +2. Advance the scan cursor (``xchk_iscan_iter``) to get the next > > > inode. > > > + If one is provided: > > > + > > > + a. Lock the inode to prevent updates during the scan. > > > + > > > + b. Scan the inode. > > > + > > > + c. While still holding the inode lock, adjust the visited > > > inode > > > cursor > > > + (``xchk_iscan_mark_visited``) to point to this inode. > > > + > > > + d. Unlock and release the inode. > > > + > > > +8. Call ``xchk_iscan_finish`` to complete the scan. > > > + > > > +There are subtleties with the inode cache that complicate > > > grabbing > > > the incore > > > +inode for the caller. > > > +Obviously, it is an absolute requirement that the inode metadata > > > be > > > consistent > > > +enough to load it into the inode cache. > > > +Second, if the incore inode is stuck in some intermediate state, > > > the > > > scan > > > +coordinator must release the AGI and push the main filesystem to > > > get > > > the inode > > > +back into a loadable state. > > > + > > > +The proposed patches are the > > > +`inode scanner > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-iscan>`_ > > > +series. > > > + > > > +Inode Management > > > +```````````````` > > > + > > > +In regular filesystem code, references to allocated XFS incore > > > inodes are > > > +always obtained (``xfs_iget``) outside of transaction context > > > because the > > > +creation of the incore context for ane xisting file does not > > > require > > an existing > > Corrected, thank you. > > > > metadata > > > +updates. > > > +However, it is important to note that references to incore > > > inodes > > > obtained as > > > +part of file creation must be performed in transaction context > > > because the > > > +filesystem must ensure the atomicity of the ondisk inode btree > > > index > > > updates > > > +and the initialization of the actual ondisk inode. > > > + > > > +References to incore inodes are always released (``xfs_irele``) > > > outside of > > > +transaction context because there are a handful of activities > > > that > > > might > > > +require ondisk updates: > > > + > > > +- The VFS may decide to kick off writeback as part of a > > > ``DONTCACHE`` inode > > > + release. > > > + > > > +- Speculative preallocations need to be unreserved. > > > + > > > +- An unlinked file may have lost its last reference, in which > > > case > > > the entire > > > + file must be inactivated, which involves releasing all of its > > > resources in > > > + the ondisk metadata and freeing the inode. > > > + > > > +These activities are collectively called inode inactivation. > > > +Inactivation has two parts -- the VFS part, which initiates > > > writeback on all > > > +dirty file pages, and the XFS part, which cleans up XFS-specific > > > information > > > +and frees the inode if it was unlinked. > > > +If the inode is unlinked (or unconnected after a file handle > > > operation), the > > > +kernel drops the inode into the inactivation machinery > > > immediately. > > > + > > > +During normal operation, resource acquisition for an update > > > follows > > > this order > > > +to avoid deadlocks: > > > + > > > +1. Inode reference (``iget``). > > > + > > > +2. Filesystem freeze protection, if repairing > > > (``mnt_want_write_file``). > > > + > > > +3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. > > > + > > > +4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for > > > operations that > > > + can update page cache mappings. > > > + > > > +5. Log feature enablement. > > > + > > > +6. Transaction log space grant. > > > + > > > +7. Space on the data and realtime devices for the transaction. > > > + > > > +8. Incore dquot references, if a file is being repaired. > > > + Note that they are not locked, merely acquired. > > > + > > > +9. Inode ``ILOCK`` for file metadata updates. > > > + > > > +10. AG header buffer locks / Realtime metadata inode ILOCK. > > > + > > > +11. Realtime metadata buffer locks, if applicable. > > > + > > > +12. Extent mapping btree blocks, if applicable. > > > + > > > +Resources are often released in the reverse order, though this > > > is > > > not required. > > > +However, online fsck differs from regular XFS operations because > > > it > > > may examine > > > +an object that normally is acquired in a later stage of the > > > locking > > > order, and > > > +then decide to cross-reference the object with an object that is > > > acquired > > > +earlier in the order. > > > +The next few sections detail the specific ways in which online > > > fsck > > > takes care > > > +to avoid deadlocks. > > > + > > > +iget and irele During a Scrub > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > +An inode scan performed on behalf of a scrub operation runs in > > > transaction > > > +context, and possibly with resources already locked and bound to > > > it. > > > +This isn't much of a problem for ``iget`` since it can operate > > > in > > > the context > > > +of an existing transaction, as long as all of the bound > > > resources > > > are acquired > > > +before the inode reference in the regular filesystem. > > > + > > > +When the VFS ``iput`` function is given a linked inode with no > > > other > > > +references, it normally puts the inode on an LRU list in the > > > hope > > > that it can > > > +save time if another process re-opens the file before the system > > > runs out > > > +of memory and frees it. > > > +Filesystem callers can short-circuit the LRU process by setting > > > a > > > ``DONTCACHE`` > > > +flag on the inode to cause the kernel to try to drop the inode > > > into > > > the > > > +inactivation machinery immediately. > > > + > > > +In the past, inactivation was always done from the process that > > > dropped the > > > +inode, which was a problem for scrub because scrub may already > > > hold > > > a > > > +transaction, and XFS does not support nesting transactions. > > > +On the other hand, if there is no scrub transaction, it is > > > desirable > > > to drop > > > +otherwise unused inodes immediately to avoid polluting caches. > > > +To capture these nuances, the online fsck code has a separate > > > ``xchk_irele`` > > > +function to set or clear the ``DONTCACHE`` flag to get the > > > required > > > release > > > +behavior. > > > + > > > +Proposed patchsets include fixing > > > +`scrub iget usage > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-iget-fixes>`_ and > > > +`dir iget usage > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-dir-iget-fixes>`_. > > > + > > > +Locking Inodes > > > +^^^^^^^^^^^^^^ > > > + > > > +In regular filesystem code, the VFS and XFS will acquire > > > multiple > > > IOLOCK locks > > > +in a well-known order: parent → child when updating the > > > directory > > > tree, and > > > +``struct inode`` address order otherwise. > > > +For regular files, the MMAPLOCK can be acquired after the IOLOCK > > > to > > > stop page > > > +faults. > > > +If two MMAPLOCKs must be acquired, they are acquired in > > > > > > > ``struct > > > +address_space`` order. > > the order of their memory address > > > > ? > > Urghg. I think I need to clarify this more: > > "...they are acquired in numerical order of the addresses of their > ``struct address_space`` objects." > > See filemap_invalidate_lock_two. > Yep, I think that works > > > +Due to the structure of existing filesystem code, IOLOCKs and > > > MMAPLOCKs must be > > > +acquired before transactions are allocated. > > > +If two ILOCKs must be acquired, they are acquired in inumber > > > order. > > > + > > > +Inode lock acquisition must be done carefully during a > > > coordinated > > > inode scan. > > > +Online fsck cannot abide these conventions, because for a > > > directory > > > tree > > > +scanner, the scrub process holds the IOLOCK of the file being > > > scanned and it > > > +needs to take the IOLOCK of the file at the other end of the > > > directory link. > > > +If the directory tree is corrupt because it contains a cycle, > > > ``xfs_scrub`` > > > +cannot use the regular inode locking functions and avoid > > > becoming > > > trapped in an > > > +ABBA deadlock. > > > + > > > +Solving both of these problems is straightforward -- any time > > > online > > > fsck > > > +needs to take a second lock of the same class, it uses trylock > > > to > > > avoid an ABBA > > > +deadlock. > > > +If the trylock fails, scrub drops all inode locks and use > > > trylock > > > loops to > > > +(re)acquire all necessary resources. > > > +Trylock loops enable scrub to check for pending fatal signals, > > > which > > > is how > > > +scrub avoids deadlocking the filesystem or becoming an > > > unresponsive > > > process. > > > +However, trylock loops means that online fsck must be prepared > > > to > > > measure the > > > +resource being scrubbed before and after the lock cycle to > > > detect > > > changes and > > > +react accordingly. > > > + > > > +.. _dirparent: > > > + > > > +Case Study: Finding a Directory Parent > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > +Consider the directory parent pointer repair code as an example. > > > +Online fsck must verify that the dotdot dirent of a directory > > > points > > > up to a > > > +parent directory, and that the parent directory contains exactly > > > one > > > dirent > > > +pointing down to the child directory. > > > +Fully validating this relationship (and repairing it if > > > possible) > > > requires a > > > +walk of every directory on the filesystem while holding the > > > child > > > locked, and > > > +while updates to the directory tree are being made. > > > +The coordinated inode scan provides a way to walk the filesystem > > > without the > > > +possibility of missing an inode. > > > +The child directory is kept locked to prevent updates to the > > > dotdot > > > dirent, but > > > +if the scanner fails to lock a parent, it can drop and relock > > > both > > > the child > > > +and the prospective parent. > > > +If the dotdot entry changes while the directory is unlocked, > > > then a > > > move or > > > +rename operation must have changed the child's parentage, and > > > the > > > scan can > > > +exit early. > > > + > > > +The proposed patchset is the > > > +`directory repair > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-dirs>`_ > > > +series. > > > + > > > +.. _fshooks: > > > + > > > +Filesystem Hooks > > > +````````````````` > > > + > > > +The second piece of support that online fsck functions need > > > during a > > > full > > > +filesystem scan is the ability to stay informed about updates > > > being > > > made by > > > +other threads in the filesystem, since comparisons against the > > > past > > > are useless > > > +in a dynamic environment. > > > +Two pieces of Linux kernel infrastructure enable online fsck to > > > monitor regular > > > +filesystem operations: filesystem hooks and :ref:`static > > > keys<jump_labels>`. > > > + > > > +Filesystem hooks convey information about an ongoing filesystem > > > operation to > > > +a downstream consumer. > > > +In this case, the downstream consumer is always an online fsck > > > function. > > > +Because multiple fsck functions can run in parallel, online fsck > > > uses the Linux > > > +notifier call chain facility to dispatch updates to any number > > > of > > > interested > > > +fsck processes. > > > +Call chains are a dynamic list, which means that they can be > > > configured at > > > +run time. > > > +Because these hooks are private to the XFS module, the > > > information > > > passed along > > > +contains exactly what the checking function needs to update its > > > observations. > > > + > > > +The current implementation of XFS hooks uses SRCU notifier > > > chains to > > > reduce the > > > +impact to highly threaded workloads. > > > +Regular blocking notifier chains use a rwsem and seem to have a > > > much > > > lower > > > +overhead for single-threaded applications. > > > +However, it may turn out that the combination of blocking chains > > > and > > > static > > > +keys are a more performant combination; more study is needed > > > here. > > > + > > > +The following pieces are necessary to hook a certain point in > > > the > > > filesystem: > > > + > > > +- A ``struct xfs_hooks`` object must be embedded in a convenient > > > place such as > > > + a well-known incore filesystem object. > > > + > > > +- Each hook must define an action code and a structure > > > containing > > > more context > > > + about the action. > > > + > > > +- Hook providers should provide appropriate wrapper functions > > > and > > > structs > > > + around the ``xfs_hooks`` and ``xfs_hook`` objects to take > > > advantage of type > > > + checking to ensure correct usage. > > > + > > > +- A callsite in the regular filesystem code must be chosen to > > > call > > > + ``xfs_hooks_call`` with the action code and data structure. > > > + This place should be adjacent to (and not earlier than) the > > > place > > > where > > > + the filesystem update is committed to the transaction. > > > + In general, when the filesystem calls a hook chain, it should > > > be > > > able to > > > + handle sleeping and should not be vulnerable to memory reclaim > > > or > > > locking > > > + recursion. > > > + However, the exact requirements are very dependent on the > > > context > > > of the hook > > > + caller and the callee. > > > + > > > +- The online fsck function should define a structure to hold > > > scan > > > data, a lock > > > + to coordinate access to the scan data, and a ``struct > > > xfs_hook`` > > > object. > > > + The scanner function and the regular filesystem code must > > > acquire > > > resources > > > + in the same order; see the next section for details. > > > + > > > +- The online fsck code must contain a C function to catch the > > > hook > > > action code > > > + and data structure. > > > + If the object being updated has already been visited by the > > > scan, > > > then the > > > + hook information must be applied to the scan data. > > > + > > > +- Prior to unlocking inodes to start the scan, online fsck must > > > call > > > + ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and > > > + ``xfs_hooks_add`` to enable the hook. > > > + > > > +- Online fsck must call ``xfs_hooks_del`` to disable the hook > > > once > > > the scan is > > > + complete. > > > + > > > +The number of hooks should be kept to a minimum to reduce > > > complexity. > > > +Static keys are used to reduce the overhead of filesystem hooks > > > to > > > nearly > > > +zero when online fsck is not running. > > > + > > > +.. _liveupdate: > > > + > > > +Live Updates During a Scan > > > +`````````````````````````` > > > + > > > +The code paths of the online fsck scanning code and the > > > :ref:`hooked<fshooks>` > > > +filesystem code look like this:: > > > + > > > + other program > > > + ↓ > > > + inode lock ←────────────────────┐ > > > + ↓ │ > > > + AG header lock │ > > > + ↓ │ > > > + filesystem function │ > > > + ↓ │ > > > + notifier call chain │ same > > > + ↓ ├─── inode > > > + scrub hook function │ lock > > > + ↓ │ > > > + scan data mutex ←──┐ same │ > > > + ↓ ├─── scan │ > > > + update scan data │ lock │ > > > + ↑ │ │ > > > + scan data mutex ←──┘ │ > > > + ↑ │ > > > + inode lock ←────────────────────┘ > > > + ↑ > > > + scrub function > > > + ↑ > > > + inode scanner > > > + ↑ > > > + xfs_scrub > > > + > > > +These rules must be followed to ensure correct interactions > > > between > > > the > > > +checking code and the code making an update to the filesystem: > > > + > > > +- Prior to invoking the notifier call chain, the filesystem > > > function > > > being > > > + hooked must acquire the same lock that the scrub scanning > > > function > > > acquires > > > + to scan the inode. > > > + > > > +- The scanning function and the scrub hook function must > > > coordinate > > > access to > > > + the scan data by acquiring a lock on the scan data. > > > + > > > +- Scrub hook function must not add the live update information > > > to > > > the scan > > > + observations unless the inode being updated has already been > > > scanned. > > > + The scan coordinator has a helper predicate > > > (``xchk_iscan_want_live_update``) > > > + for this. > > > + > > > +- Scrub hook functions must not change the caller's state, > > > including > > > the > > > + transaction that it is running. > > > + They must not acquire any resources that might conflict with > > > the > > > filesystem > > > + function being hooked. > > > + > > > +- The hook function can abort the inode scan to avoid breaking > > > the > > > other rules. > > > + > > > +The inode scan APIs are pretty simple: > > > + > > > +- ``xchk_iscan_start`` starts a scan > > > + > > > +- ``xchk_iscan_iter`` grabs a reference to the next inode in the > > > scan or > > > + returns zero if there is nothing left to scan > > > + > > > +- ``xchk_iscan_want_live_update`` to decide if an inode has > > > already > > > been > > > + visited in the scan. > > > + This is critical for hook functions to decide if they need to > > > update the > > > + in-memory scan information. > > > + > > > +- ``xchk_iscan_mark_visited`` to mark an inode as having been > > > visited in the > > > + scan > > > + > > > +- ``xchk_iscan_finish`` to finish the scan > > > + > > > +The proposed patches are at the start of the > > > +`online quotacheck > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-quota>`_ > > > +series. > > Wrong link? This looks like it goes to the section below. > > Oops. This one should link to scrub-iscan, and the next one should > link > to repair-quotacheck. > > > > + > > > +.. _quotacheck: > > > + > > > +Case Study: Quota Counter Checking > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > +It is useful to compare the mount time quotacheck code to the > > > online > > > repair > > > +quotacheck code. > > > +Mount time quotacheck does not have to contend with concurrent > > > operations, so > > > +it does the following: > > > + > > > +1. Make sure the ondisk dquots are in good enough shape that all > > > the > > > incore > > > + dquots will actually load, and zero the resource usage > > > counters > > > in the > > > + ondisk buffer. > > > + > > > +2. Walk every inode in the filesystem. > > > + Add each file's resource usage to the incore dquot. > > > + > > > +3. Walk each incore dquot. > > > + If the incore dquot is not being flushed, add the ondisk > > > buffer > > > backing the > > > + incore dquot to a delayed write (delwri) list. > > > + > > > +4. Write the buffer list to disk. > > > + > > > +Like most online fsck functions, online quotacheck can't write > > > to > > > regular > > > +filesystem objects until the newly collected metadata reflect > > > all > > > filesystem > > > +state. > > > +Therefore, online quotacheck records file resource usage to a > > > shadow > > > dquot > > > +index implemented with a sparse ``xfarray``, and only writes to > > > the > > > real dquots > > > +once the scan is complete. > > > +Handling transactional updates is tricky because quota resource > > > usage updates > > > +are handled in phases to minimize contention on dquots: > > > + > > > +1. The inodes involved are joined and locked to a transaction. > > > + > > > +2. For each dquot attached to the file: > > > + > > > + a. The dquot is locked. > > > + > > > + b. A quota reservation is added to the dquot's resource > > > usage. > > > + The reservation is recorded in the transaction. > > > + > > > + c. The dquot is unlocked. > > > + > > > +3. Changes in actual quota usage are tracked in the transaction. > > > + > > > +4. At transaction commit time, each dquot is examined again: > > > + > > > + a. The dquot is locked again. > > > + > > > + b. Quota usage changes are logged and unused reservation is > > > given > > > back to > > > + the dquot. > > > + > > > + c. The dquot is unlocked. > > > + > > > +For online quotacheck, hooks are placed in steps 2 and 4. > > > +The step 2 hook creates a shadow version of the transaction > > > dquot > > > context > > > +(``dqtrx``) that operates in a similar manner to the regular > > > code. > > > +The step 4 hook commits the shadow ``dqtrx`` changes to the > > > shadow > > > dquots. > > > +Notice that both hooks are called with the inode locked, which > > > is > > > how the > > > +live update coordinates with the inode scanner. > > > + > > > +The quotacheck scan looks like this: > > > + > > > +1. Set up a coordinated inode scan. > > > + > > > +2. For each inode returned by the inode scan iterator: > > > + > > > + a. Grab and lock the inode. > > > + > > > + b. Determine that inode's resource usage (data blocks, inode > > > counts, > > > + realtime blocks) > > nit: move this list to the first appearance of "resource usage". > > Step > > 2 of the first list I think > > I don't understand this proposed change. Are you talking about "2. > For > each dquot attached to the file:" above? That list describes the > steps > taken by regular code wanting to allocate file space that's accounted > to > quotas. This list describes what online quotacheck does. The two > don't > mix. Oh, youre right, disregard this one > > > > and add that to the shadow dquots for the user, group, > > > + and project ids associated with the inode. > > > + > > > + c. Unlock and release the inode. > > > + > > > +3. For each dquot in the system: > > > + > > > + a. Grab and lock the dquot. > > > + > > > + b. Check the dquot against the shadow dquots created by the > > > scan > > > and updated > > > + by the live hooks. > > > + > > > +Live updates are key to being able to walk every quota record > > > without > > > +needing to hold any locks for a long duration. > > > +If repairs are desired, the real and shadow dquots are locked > > > and > > > their > > > +resource counts are set to the values in the shadow dquot. > > > + > > > +The proposed patchset is the > > > +`online quotacheck > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-quota>`_ > > Changed from repair-quota to repair-quotacheck. > > > > +series. > > > + > > > +.. _nlinks: > > > + > > > +Case Study: File Link Count Checking > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > +File link count checking also uses live update hooks. > > > +The coordinated inode scanner is used to visit all directories > > > on > > > the > > > +filesystem, and per-file link count records are stored in a > > > sparse > > > ``xfarray`` > > > +indexed by inumber. > > > +During the scanning phase, each entry in a directory generates > > > observation > > > +data as follows: > > > + > > > +1. If the entry is a dotdot (``'..'``) entry of the root > > > directory, > > > the > > > + directory's parent link count is bumped because the root > > > directory's dotdot > > > + entry is self referential. > > > + > > > +2. If the entry is a dotdot entry of a subdirectory, the > > > parent's > > > backref > > > + count is bumped. > > > + > > > +3. If the entry is neither a dot nor a dotdot entry, the target > > > file's parent > > > + count is bumped. > > > + > > > +4. If the target is a subdirectory, the parent's child link > > > count is > > > bumped. > > > + > > > +A crucial point to understand about how the link count inode > > > scanner > > > interacts > > > +with the live update hooks is that the scan cursor tracks which > > > *parent* > > > +directories have been scanned. > > > +In other words, the live updates ignore any update about ``A → > > > B`` > > > when A has > > > +not been scanned, even if B has been scanned. > > > +Furthermore, a subdirectory A with a dotdot entry pointing back > > > to B > > > is > > > +accounted as a backref counter in the shadow data for A, since > > > child > > > dotdot > > > +entries affect the parent's link count. > > > +Live update hooks are carefully placed in all parts of the > > > filesystem that > > > +create, change, or remove directory entries, since those > > > operations > > > involve > > > +bumplink and droplink. > > > + > > > +For any file, the correct link count is the number of parents > > > plus > > > the number > > > +of child subdirectories. > > > +Non-directories never have children of any kind. > > > +The backref information is used to detect inconsistencies in the > > > number of > > > +links pointing to child subdirectories and the number of dotdot > > > entries > > > +pointing back. > > > + > > > +After the scan completes, the link count of each file can be > > > checked > > > by locking > > > +both the inode and the shadow data, and comparing the link > > > counts. > > > +A second coordinated inode scan cursor is used for comparisons. > > > +Live updates are key to being able to walk every inode without > > > needing to hold > > > +any locks between inodes. > > > +If repairs are desired, the inode's link count is set to the > > > value > > > in the > > > +shadow information. > > > +If no parents are found, the file must be :ref:`reparented > > > <orphanage>` to the > > > +orphanage to prevent the file from being lost forever. > > > + > > > +The proposed patchset is the > > > +`file link count repair > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-nlinks>`_ > > > +series. > > > + > > > +.. _rmap_repair: > > > + > > > +Case Study: Rebuilding Reverse Mapping Records > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > +Most repair functions follow the same pattern: lock filesystem > > > resources, > > > +walk the surviving ondisk metadata looking for replacement > > > metadata > > > records, > > > +and use an :ref:`in-memory array <xfarray>` to store the > > > gathered > > > observations. > > > +The primary advantage of this approach is the simplicity and > > > modularity of the > > > +repair code -- code and data are entirely contained within the > > > scrub > > > module, > > > +do not require hooks in the main filesystem, and are usually the > > > most efficient > > > +in memory use. > > > +A secondary advantage of this repair approach is atomicity -- > > > once > > > the kernel > > > +decides a structure is corrupt, no other threads can access the > > > metadata until > > > +the kernel finishes repairing and revalidating the metadata. > > > + > > > +For repairs going on within a shard of the filesystem, these > > > advantages > > > +outweigh the delays inherent in locking the shard while > > > repairing > > > parts of the > > > +shard. > > > +Unfortunately, repairs to the reverse mapping btree cannot use > > > the > > > "standard" > > > +btree repair strategy because it must scan every space mapping > > > of > > > every fork of > > > +every file in the filesystem, and the filesystem cannot stop. > > > +Therefore, rmap repair foregoes atomicity between scrub and > > > repair. > > > +It combines a :ref:`coordinated inode scanner <iscan>`, > > > :ref:`live > > > update hooks > > > +<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to > > > complete the > > > +scan for reverse mapping records. > > > + > > > +1. Set up an xfbtree to stage rmap records. > > > + > > > +2. While holding the locks on the AGI and AGF buffers acquired > > > during the > > > + scrub, generate reverse mappings for all AG metadata: inodes, > > > btrees, CoW > > > + staging extents, and the internal log. > > > + > > > +3. Set up an inode scanner. > > > + > > > +4. Hook into rmap updates for the AG being repaired so that the > > > live > > > scan data > > > + can receive updates to the rmap btree from the rest of the > > > filesystem during > > > + the file scan. > > > + > > > +5. For each space mapping found in either fork of each file > > > scanned, > > > + decide if the mapping matches the AG of interest. > > > + If so: > > > + > > > + a. Create a btree cursor for the in-memory btree. > > > + > > > + b. Use the rmap code to add the record to the in-memory > > > btree. > > > + > > > + c. Use the :ref:`special commit function <xfbtree_commit>` to > > > write the > > > + xfbtree changes to the xfile. > > > + > > > +6. For each live update received via the hook, decide if the > > > owner > > > has already > > > + been scanned. > > > + If so, apply the live update into the scan data: > > > + > > > + a. Create a btree cursor for the in-memory btree. > > > + > > > + b. Replay the operation into the in-memory btree. > > > + > > > + c. Use the :ref:`special commit function <xfbtree_commit>` to > > > write the > > > + xfbtree changes to the xfile. > > > + This is performed with an empty transaction to avoid > > > changing > > > the > > > + caller's state. > > > + > > > +7. When the inode scan finishes, create a new scrub transaction > > > and > > > relock the > > > + two AG headers. > > > + > > > +8. Compute the new btree geometry using the number of rmap > > > records > > > in the > > > + shadow btree, like all other btree rebuilding functions. > > > + > > > +9. Allocate the number of blocks computed in the previous step. > > > + > > > +10. Perform the usual btree bulk loading and commit to install > > > the > > > new rmap > > > + btree. > > > + > > > +11. Reap the old rmap btree blocks as discussed in the case > > > study > > > about how > > > + to :ref:`reap after rmap btree repair <rmap_reap>`. > > > + > > > +12. Free the xfbtree now that it not needed. > > > + > > > +The proposed patchset is the > > > +`rmap repair > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=repair-rmap-btree>`_ > > > +series. > > > > > > > Mostly looks good nits aside, I do sort of wonder if this patch > > would > > do better to appear before patch 6 (or move 6 down), since it gets > > into > > more challenges concerning locks and hooks, where as here we are > > mostly > > discussing what they are and how they work. So it might build > > better > > to move this patch up a little. > > (I might be a tad confused here, bear with me.) > > Patch 6, the section about eventual consistency? > > Hmm. The intent drains exist to quiesce intent chains targeting > specific AGs. It briefly mentions "fshooks" in the context of using > jump labels to avoid the overhead of calling notify_all on the drain > waitqueue when scrub isn't running. That's perhaps bad naming on my > part, since the other "fshooks" are jump labels to avoid bouncing > through the notifier chain code when scrub isn't running. The jump > labels themselves are not hooks, they're structured dynamic code > patching. > > I probably should've named those something else. fsgates? Oh, i see, yes I did sort of try to correlate them, so maybe the different name would help. > > Or maybe you were talking specifically about "Case Study: Rebuilding > Reverse Mapping Records"? In which case I remark that the case study > needs both the intent drains to quiesce the AG and the live scans to > work properly, which is why the case study of it couldn't come > earlier. > The intent drains section still ought to come before the refcountbt > section, because it's the refcountbt scrubber that first hit the > coordination problem. > > Things are getting pretty awkward like this because there are sooo > many > interdependent pieces. :( I see, ok no worries then, I think people will figure it out either way. I mostly look for ways to make the presentation easier but it is getting harder to move stuff with chicken and egg dependencies. > > Regardless, thank you very much for slogging through. > > --D > > > Allison > >