On Wed, Jan 18, 2023 at 12:03:13AM +0000, Allison Henderson wrote: > On Wed, 2023-01-11 at 15:39 -0800, Darrick J. Wong wrote: > > On Wed, Jan 11, 2023 at 01:25:12AM +0000, Allison Henderson wrote: > > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > Start the second chapter of the online fsck design documentation. > > > > This covers the general theory underlying how online fsck works. > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > --- > > > > .../filesystems/xfs-online-fsck-design.rst | 366 > > > > ++++++++++++++++++++ > > > > 1 file changed, 366 insertions(+) > > > > > > > > > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > > > > b/Documentation/filesystems/xfs-online-fsck-design.rst > > > > index 25717ebb5f80..a03a7b9f0250 100644 > > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > > > > @@ -197,3 +197,369 @@ metadata to enable targeted checking and > > > > repair > > > > operations while the system > > > > is running. > > > > This capability will be coupled to automatic system management > > > > so > > > > that > > > > autonomous self-healing of XFS maximizes service availability. > > > > + > > > > +2. Theory of Operation > > > > +====================== > > > > + > > > > +Because it is necessary for online fsck to lock and scan live > > > > metadata objects, > > > > +online fsck consists of three separate code components. > > > > +The first is the userspace driver program ``xfs_scrub``, which > > > > is > > > > responsible > > > > +for identifying individual metadata items, scheduling work items > > > > for > > > > them, > > > > +reacting to the outcomes appropriately, and reporting results to > > > > the > > > > system > > > > +administrator. > > > > +The second and third are in the kernel, which implements > > > > functions > > > > to check > > > > +and repair each type of online fsck work item. > > > > + > > > > ++--------------------------------------------------------------- > > > > ---+ > > > > +| > > > > **Note**: > > > > | > > > > ++--------------------------------------------------------------- > > > > ---+ > > > > +| For brevity, this document shortens the phrase "online fsck > > > > work | > > > > +| item" to "scrub > > > > item". | > > > > ++--------------------------------------------------------------- > > > > ---+ > > > > + > > > > +Scrub item types are delineated in a manner consistent with the > > > > Unix > > > > design > > > > +philosophy, which is to say that each item should handle one > > > > aspect > > > > of a > > > > +metadata structure, and handle it well. > > > > + > > > > +Scope > > > > +----- > > > > + > > > > +In principle, online fsck should be able to check and to repair > > > > everything that > > > > +the offline fsck program can handle. > > > > +However, the adjective *online* brings with it the limitation > > > > that > > > > online fsck > > > > +cannot deal with anything that prevents the filesystem from > > > > going on > > > > line, i.e. > > > > +mounting. > > > Are there really any other operations that do that other than > > > mount? > > > > No. > > > > > I think this reads cleaner: > > > > > > By definition, online fsck can only check and repair an online > > > filesystem. It cannot check mounting operations which start from > > > an > > > offline state. > > > > Now that I think about this some more, this whole sentence doesn't > > make > > sense. xfs_scrub can *definitely* detect and fix latent errors that > > would prevent the /next/ mount from succeeding. It's only the fuzz > > test > > suite that stumbles over this, and only because xfs_db cannot fuzz > > mounted filesystems. > > > > "However, online fsck cannot be running 100% of the time, which means > > that latent errors may creep in after a scrub completes. > > If these errors cause the next mount to fail, offline fsck is the > > only > > solution." > Sure, that sounds fair > > > > > > > +This limitation means that maintenance of the offline fsck tool > > > > will > > > > continue. > > > > +A second limitation of online fsck is that it must follow the > > > > same > > > > resource > > > > +sharing and lock acquisition rules as the regular filesystem. > > > > +This means that scrub cannot take *any* shortcuts to save time, > > > > because doing > > > > +so could lead to concurrency problems. > > > > +In other words, online fsck will never be able to fix 100% of > > > > the > > > > +inconsistencies that offline fsck can repair, > > > Hmm, what inconsistencies cannot repaired as a result of the "no > > > shortcut" rule? I'm all for keeping things short and to the point, > > > but > > > since this section is about scope, I'd give it at least a brief > > > bullet > > > list > > > > Hmm. I can't think of any off the top of my head. Given the > > rewording > > earlier, I think it's more accurate to say: > > > > "In other words, online fsck is not a complete replacement for > > offline > > fsck, and a complete run of online fsck may take longer than online > > fsck." > That makes sense > > > > > > and a complete run of online fsck > > > > +may take longer. > > > > +However, both of these limitations are acceptable tradeoffs to > > > > satisfy the > > > > +different motivations of online fsck, which are to **minimize > > > > system > > > > downtime** > > > > +and to **increase predictability of operation**. > > > > + > > > > +.. _scrubphases: > > > > + > > > > +Phases of Work > > > > +-------------- > > > > + > > > > +The userspace driver program ``xfs_scrub`` splits the work of > > > > checking and > > > > +repairing an entire filesystem into seven phases. > > > > +Each phase concentrates on checking specific types of scrub > > > > items > > > > and depends > > > > +on the success of all previous phases. > > > > +The seven phases are as follows: > > > > + > > > > +1. Collect geometry information about the mounted filesystem and > > > > computer, > > > > + discover the online fsck capabilities of the kernel, and open > > > > the > > > > + underlying storage devices. > > > > + > > > > +2. Check allocation group metadata, all realtime volume > > > > metadata, > > > > and all quota > > > > + files. > > > > + Each metadata structure is scheduled as a separate scrub > > > > item. > > > Like an intent item? > > > > No, these scrub items are struct scrub_item objects that exist solely > > within the userspace program code. > > > > > > + If corruption is found in the inode header or inode btree and > > > > ``xfs_scrub`` > > > > + is permitted to perform repairs, then those scrub items are > > > > repaired to > > > > + prepare for phase 3. > > > > + Repairs are implemented by resubmitting the scrub item to the > > > > kernel with > > > If I'm understanding this correctly: > > > Repairs are implemented as intent items that are queued and > > > committed > > > just as any filesystem operation. > > > > > > ? > > > > I don't want to go too deep into this prematurely, but... > > > > xfs_scrub (the userspace program) needs to track which metadata > > objects > > have been checked and which ones need repairs. The current codebase > > (ab)uses struct xfs_scrub_metadata, but it's very memory inefficient. > > I replaced it with a new struct scrub_item that stores (a) all the > > handle information to identify the inode/AG/rt group/whatever; and > > (b) > > the state of all the checks that can be applied to that item: > > > > struct scrub_item { > > /* > > * Information we need to call the scrub and repair ioctls. > > * Per-AG items should set the ino/gen fields to -1; per- > > inode > > * items should set sri_agno to -1; and per-fs items should > > set > > * all three fields to -1. Or use the macros below. > > */ > > __u64 sri_ino; > > __u32 sri_gen; > > __u32 sri_agno; > > > > /* Bitmask of scrub types that were scheduled here. */ > > __u32 sri_selected; > > > > /* Scrub item state flags, one for each XFS_SCRUB_TYPE. */ > > __u8 sri_state[XFS_SCRUB_TYPE_NR]; > > > > /* Track scrub and repair call retries for each scrub type. > > */ > > __u8 sri_tries[XFS_SCRUB_TYPE_NR]; > > > > /* Were there any corruption repairs needed? */ > > bool sri_inconsistent:1; > > > > /* Are we revalidating after repairs? */ > > bool sri_revalidate:1; > > }; > > > > The first three fields are passed to the kernel via scrub ioctl and > > describe a particular xfs domain (files, AGs, etc). The rest of the > > structure store state for each type of repair that can be performed > > against that domain. > > > > IOWs, xfs_scrub uses struct scrub_item objects to generate ioctl > > calls > > to the kernel to check and repair things. The kernel reads the ioctl > > information, figures out what needs to be done, and then does the > > usual > > get transaction -> lock things -> make updates -> commit dance to > > make > > corrections to the fs. Those corrections include log intent items, > > but > > there's no tight coupling between log intent items and scrub_items. > > > > Side note: The kernel repair code used to use intents to rebuild a > > structure, but nowadays it use the btree bulk loader code to replace > > btrees wholesale and in a single atomic commit. Now we use them > > primariliy to free preallocated space if the repair fails. > > Oh ok, well how about just: > > "Repairs are implemented by resubmitting the scrub item to the > kernel through a designated ioctl with..." > > ? How about: "Repairs are implemented by using the information in the scrub item to resubmit the kernel scrub call with the repair flag enabled; this is discussed in the next section. Optimizations and all other repairs are deferred to phase 4." ? > > > > > > + the repair flag enabled; this is discussed in the next > > > > section. > > > > + Optimizations and all other repairs are deferred to phase 4. > > > I guess I'll come back to it. > > > > > > > + > > > > +3. Check all metadata of every file in the filesystem. > > > > + Each metadata structure is also scheduled as a separate scrub > > > > item. > > > > + If repairs are needed, ``xfs_scrub`` is permitted to perform > > > > repairs, > > > If repairs are needed and ``xfs_scrub`` is permitted > > > > Fixed. > > > > > ? > > > > + and there were no problems detected during phase 2, then > > > > those > > > > scrub items > > > > + are repaired. > > > > + Optimizations and unsuccessful repairs are deferred to phase > > > > 4. > > > > + > > > > +4. All remaining repairs and scheduled optimizations are > > > > performed > > > > during this > > > > + phase, if the caller permits them. > > > > + Before starting repairs, the summary counters are checked and > > > > any > > > Did we talk about summary counters yet? Maybe worth a blub. > > > Otherwise > > > this may not make sense with out skipping ahead or into the code > > > > Nope. I'll add that to the previous patch when I introduce primary > > and > > secondary metadata. Good catch! > > > > "Summary metadata, as the name implies, condense information > > contained > > in primary metadata for performance reasons." > > Ok, sounds good then > > > > > > necessary > > > > + repairs are performed so that subsequent repairs will not > > > > fail > > > > the resource > > > > + reservation step due to wildly incorrect summary counters. > > > > + Unsuccesful repairs are requeued as long as forward progress > > > > on > > > > repairs is > > > > + made somewhere in the filesystem. > > > > + Free space in the filesystem is trimmed at the end of phase 4 > > > > if > > > > the > > > > + filesystem is clean. > > > > + > > > > +5. By the start of this phase, all primary and secondary > > > > filesystem > > > > metadata > > > > + must be correct. > > > I think maybe the definitions of primary and secondary metadata > > > should > > > move up before the phases section. Otherwise the reader has to > > > skip > > > ahead to know what that means. > > > > Yep, now primary, secondary, and summary metadata are defined in > > section > > 1. Very good comment. > > > > > > + Summary counters such as the free space counts and quota > > > > resource > > > > counts > > > > + are checked and corrected. > > > > + Directory entry names and extended attribute names are > > > > checked > > > > for > > > > + suspicious entries such as control characters or confusing > > > > Unicode sequences > > > > + appearing in names. > > > > + > > > > +6. If the caller asks for a media scan, read all allocated and > > > > written data > > > > + file extents in the filesystem. > > > > + The ability to use hardware-assisted data file integrity > > > > checking > > > > is new > > > > + to online fsck; neither of the previous tools have this > > > > capability. > > > > + If media errors occur, they will be mapped to the owning > > > > files > > > > and reported. > > > > + > > > > +7. Re-check the summary counters and presents the caller with a > > > > summary of > > > > + space usage and file counts. > > > > + > > > > +Steps for Each Scrub Item > > > > +------------------------- > > > > + > > > > +The kernel scrub code uses a three-step strategy for checking > > > > and > > > > repairing > > > > +the one aspect of a metadata object represented by a scrub item: > > > > + > > > > +1. The scrub item of intere > > > > st is checked for corruptions; opportunities for > > > > + optimization; and for values that are directly controlled by > > > > the > > > > system > > > > + administrator but look suspicious. > > > > + If the item is not corrupt or does not need optimization, > > > > resource are > > > > + released and the positive scan results are returned to > > > > userspace. > > > > + If the item is corrupt or could be optimized but the caller > > > > does > > > > not permit > > > > + this, resources are released and the negative scan results > > > > are > > > > returned to > > > > + userspace. > > > > + Otherwise, the kernel moves on to the second step. > > > > + > > > > +2. The repair function is called to rebuild the data structure. > > > > + Repair functions generally choose rebuild a structure from > > > > other > > > > metadata > > > > + rather than try to salvage the existing structure. > > > > + If the repair fails, the scan results from the first step are > > > > returned to > > > > + userspace. > > > > + Otherwise, the kernel moves on to the third step. > > > > + > > > > +3. In the third step, the kernel runs the same checks over the > > > > new > > > > metadata > > > > + item to assess the efficacy of the repairs. > > > > + The results of the reassessment are returned to userspace. > > > > + > > > > +Classification of Metadata > > > > +-------------------------- > > > > + > > > > +Each type of metadata object (and therefore each type of scrub > > > > item) > > > > is > > > > +classified as follows: > > > > + > > > > +Primary Metadata > > > > +```````````````` > > > > + > > > > +Metadata structures in this category should be most familiar to > > > > filesystem > > > > +users either because they are directly created by the user or > > > > they > > > > index > > > > +objects created by the user > > > I think I would just jump straight into a brief list. The above is > > > a > > > bit vague, and documentation that tells you you should already know > > > what it is, doesnt add much. Again, I think too much poetry might > > > be > > > why you're having a hard time getting responses. > > > > Done: > > > > - Free space and reference count information > > > > - Inode records and indexes > > > > - Storage mapping information for file data > > > > - Directories > > > > - Extended attributes > > > > - Symbolic links > > > > - Quota limits > > > > - Link counts > > > > > > > > +Most filesystem objects fall into this class. > > > Most filesystem objects created by users fall into this class, such > > > as > > > inode, directories, allocation groups and so on. > > > > +Resource and lock acquisition for scrub code follows the same > > > > order > > > > as regular > > > > +filesystem accesses. > > > > > > Lock acquisition for these resources will follow the same order for > > > scrub as a regular filesystem access. > > > > Yes, that is clearer. I think I'll phrase this more actively: > > > > "Scrub obeys the same rules as regular filesystem accesses for > > resource > > and lock acquisition." > > Ok, I think that sounds fine > > > > > > + > > > > +Primary metadata objects are the simplest for scrub to process. > > > > +The principal filesystem object (either an allocation group or > > > > an > > > > inode) that > > > > +owns the item being scrubbed is locked to guard against > > > > concurrent > > > > updates. > > > > +The check function examines every record associated with the > > > > type > > > > for obvious > > > > +errors and cross-references healthy records against other > > > > metadata > > > > to look for > > > > +inconsistencies. > > > > +Repairs for this class of scrub item are simple, since the > > > > repair > > > > function > > > > +starts by holding all the resources acquired in the previous > > > > step. > > > > +The repair function scans available metadata as needed to record > > > > all > > > > the > > > > +observations needed to complete the structure. > > > > +Next, it stages the observations in a new ondisk structure and > > > > commits it > > > > +atomically to complete the repair. > > > > +Finally, the storage from the old data structure are carefully > > > > reaped. > > > > + > > > > +Because ``xfs_scrub`` locks a primary object for the duration of > > > > the > > > > repair, > > > > +this is effectively an offline repair operation performed on a > > > > subset of the > > > > +filesystem. > > > > +This minimizes the complexity of the repair code because it is > > > > not > > > > necessary to > > > > +handle concurrent updates from other threads, nor is it > > > > necessary to > > > > access > > > > +any other part of the filesystem. > > > > +As a result, indexed structures can be rebuilt very quickly, and > > > > programs > > > > +trying to access the damaged structure will be blocked until > > > > repairs > > > > complete. > > > > +The only infrastructure needed by the repair code are the > > > > staging > > > > area for > > > > +observations and a means to write new structures to disk. > > > > +Despite these limitations, the advantage that online repair > > > > holds is > > > > clear: > > > > +targeted work on individual shards of the filesystem avoids > > > > total > > > > loss of > > > > +service. > > > > + > > > > +This mechanism is described in section 2.1 ("Off-Line > > > > Algorithm") of > > > > +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index > > > > Construction > > > > +Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_, > > > Hmm, this article is not displaying for me. If the link is > > > abandoned, > > > probably there's not much need to keep it around > > > > The actual paper is not directly available through that ACM link, but > > the DOI is what I used to track down a paper copy(!) of that paper as > > published in a journal. > > > > (In turn, that journal is "Advances in Database Technology - EDBT > > 1992"; > > I found it in the NYU library. Amazingly, they sold it to me.) > Oh I see. Dave had replied in a separate thread with a pdf version. > That might be a better link so that people do not have to buy a paper > copy. Yep, updated, thanks all! > > > > > > +*Extending Database Technology*, pp. 293-309, 1992. > > > > + > > > > +Most primary metadata repair functions stage their intermediate > > > > results in an > > > > +in-memory array prior to formatting the new ondisk structure, > > > > which > > > > is very > > > > +similar to the list-based algorithm discussed in section 2.3 > > > > ("List- > > > > Based > > > > +Algorithms") of Srinivasan. > > > > +However, any data structure builder that maintains a resource > > > > lock > > > > for the > > > > +duration of the repair is *always* an offline algorithm. > > > > + > > > > +Secondary Metadata > > > > +`````````````````` > > > > + > > > > +Metadata structures in this category reflect records found in > > > > primary metadata, > > > > > > such as rmap and parent pointer attributes. But they are only > > > needed... > > > > > > ? > > > > Euugh, this section needs some restructuring to get rid of redundant > > sentences. How about: > > > > "Metadata structures in this category reflect records found in > > primary > > metadata, but are only needed for online fsck or for reorganization > > of > > the filesystem. > > > > "Secondary metadata include: > > > > - Reverse mapping information > > > > - Directory parent pointers > > > > "This class of metadata is difficult for scrub to process because > > scrub > > attaches to the secondary object but needs to check primary metadata, > > which runs counter to the usual order of resource acquisition. > > Frequently, this means that full filesystems scans are necessary to > > rebuild the metadata. > > Check functions..." > > Yes I think that's much clearer :-) > > > > > > > +but are only needed for online fsck or for reorganization of the > > > > filesystem. > > > > +Resource and lock acquisition for scrub code do not follow the > > > > same > > > > order as > > > > +regular filesystem accesses, and may involve full filesystem > > > > scans. > > > > + > > > > +Secondary metadata objects are difficult for scrub to process, > > > > because scrub > > > > +attaches to the secondary object but needs to check primary > > > > metadata, which > > > > +runs counter to the usual order of resource acquisition. > > > bummer :-( > > > > Yup. > > > > > > +Check functions can be limited in scope to reduce runtime. > > > > +Repairs, however, require a full scan of primary metadata, which > > > > can > > > > take a > > > > +long time to complete. > > > > +Under these conditions, ``xfs_scrub`` cannot lock resources for > > > > the > > > > entire > > > > +duration of the repair. > > > > + > > > > +Instead, repair functions set up an in-memory staging structure > > > > to > > > > store > > > > +observations. > > > > +Depending on the requirements of the specific repair function, > > > > the > > > > staging > > > > > > > > > > +index can have the same format as the ondisk structure, or it > > > > can > > > > have a design > > > > +specific to that repair function. > > > ...will have either the same format as the ondisk structure or a > > > structure specific to the repair function. > > > > Fixed. > > > > > > +The next step is to release all locks and start the filesystem > > > > scan. > > > > +When the repair scanner needs to record an observation, the > > > > staging > > > > data are > > > > +locked long enough to apply the update. > > > > +Simultaneously, the repair function hooks relevant parts of the > > > > filesystem to > > > > +apply updates to the staging data if the the update pertains to > > > > an > > > > object that > > > > +has already been scanned by the index builder. > > > While a scan is in progress, function hooks are used to apply > > > filesystem updates to both the object and the staging data if the > > > object has already been scanned. > > > > > > ? > > > > The hooks are used to apply updates to the repair staging data, but > > they > > don't apply regular filesystem updates. > > > > The usual process runs something like this: > > > > Lock -> update -> update -> commit > > > > With a scan in progress, say we hook the second update. The > > instruction > > flow becomes: > > > > Lock -> update -> update -> hook -> update staging data -> commit > > > > Maybe something along the following would be better? > > > > "While the filesystem scan is in progress, the repair function hooks > > the > > filesystem so that it can apply pending filesystem updates to the > > staging information." > Ok, that sounds clearer then > > > > > > > +Once the scan is done, the owning object is re-locked, the live > > > > data > > > > is used to > > > > +write a new ondisk structure, and the repairs are committed > > > > atomically. > > > > +The hooks are disabled and the staging staging area is freed. > > > > +Finally, the storage from the old data structure are carefully > > > > reaped. > > > > + > > > > +Introducing concurrency helps online repair avoid various > > > > locking > > > > problems, but > > > > +comes at a high cost to code complexity. > > > > +Live filesystem code has to be hooked so that the repair > > > > function > > > > can observe > > > > +updates in progress. > > > > +The staging area has to become a fully functional parallel > > > > structure > > > > so that > > > > +updates can be merged from the hooks. > > > > +Finally, the hook, the filesystem scan, and the inode locking > > > > model > > > > must be > > > > +sufficiently well integrated that a hook event can decide if a > > > > given > > > > update > > > > +should be applied to the staging structure. > > > > + > > > > +In theory, the scrub implementation could apply these same > > > > techniques for > > > > +primary metadata, but doing so would make it massively more > > > > complex > > > > and less > > > > +performant. > > > > +Programs attempting to access the damaged structures are not > > > > blocked > > > > from > > > > +operation, which may cause application failure or an unplanned > > > > filesystem > > > > +shutdown. > > > > + > > > > +Inspiration for the secondary metadata repair strategy was drawn > > > > from section > > > > +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build > > > > Without > > > > Side-File") > > > > +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, > > > > `"Algorithms > > > > for > > > > +Creating Indexes for Very Large Tables Without Quiescing > > > > Updates" > > > > +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. > > > This one works > > > > > > > + > > > > +The sidecar index mentioned above bears some resemblance to the > > > > side > > > > file > > > > +method mentioned in Srinivasan and Mohan. > > > > +Their method consists of an index builder that extracts relevant > > > > record data to > > > > +build the new structure as quickly as possible; and an auxiliary > > > > structure that > > > > +captures all updates that would be committed to the index by > > > > other > > > > threads were > > > > +the new index already online. > > > > +After the index building scan finishes, the updates recorded in > > > > the > > > > side file > > > > +are applied to the new index. > > > > +To avoid conflicts between the index builder and other writer > > > > threads, the > > > > +builder maintains a publicly visible cursor that tracks the > > > > progress > > > > of the > > > > +scan through the record space. > > > > +To avoid duplication of work between the side file and the index > > > > builder, side > > > > +file updates are elided when the record ID for the update is > > > > greater > > > > than the > > > > +cursor position within the record ID space. > > > > + > > > > +To minimize changes to the rest of the codebase, XFS online > > > > repair > > > > keeps the > > > > +replacement index hidden until it's completely ready to go. > > > > +In other words, there is no attempt to expose the keyspace of > > > > the > > > > new index > > > > +while repair is running. > > > > +The complexity of such an approach would be very high and > > > > perhaps > > > > more > > > > +appropriate to building *new* indices. > > > > + > > > > +**Question**: Can the full scan and live update code used to > > > > facilitate a > > > > +repair also be used to implement a comprehensive check? > > > > + > > > > +*Answer*: Probably, though this has not been yet been studied. > > > I kinda feel like discussion Q&As need to be wrapped up before we > > > can > > > call things done. If this is all there was to the answer, then > > > lets > > > clean out the discussion notes. > > > > Oh, the situation here is worse than that -- in theory, check would > > be > > much stronger if each scrub function employed these live scans to > > build > > a shadow copy of the metadata and then compared the records of both. > > > > However, that increases the amount of work each scrubber has to do > > much > > higher, and the runtime of those scrubbers would go up. The other > > issue > > is that live scan hooks would have to proliferate through much more > > of > > the filesystem. That's rather more invasive to the codebase than > > most > > of fsck, so I want people to look at the usage models for the handful > > of > > scrubbers that really require it before I spread it around elsewhere. > > Making that kind of change isn't that difficult, but I want to merge > > this stuff before moving on to experimenting with improvements of > > that > > scale. > > I see, well maybe it would be appropriate it to just call it a possible > future improvement for now, depending on how the uses cases go and if > the demand for it arises. I'll go relabel these as "Future Work Questions". Thanks for continuing through! :) --D > > > > > > + > > > > +Summary Information > > > > +``````````````````` > > > > + > > > Oh, perhaps this section could move up with the other metadata > > > definitions. That way the reader already has an idea of what these > > > terms are referring to before we get into how they are used during > > > the > > > phases. > > > > Yeah, I think/hope this will be less of a problem now that section 1 > > defines all three types of metadata. The start of this section now > > reads: > > > > "Metadata structures in this last category summarize the contents of > > primary metadata records. > > These are often used to speed up resource usage queries, and are many > > times smaller than the primary metadata which they represent. > > > > Examples of summary information include: > > > > - Summary counts of free space and inodes > > > > - File link counts from directories > > > > - Quota resource usage counts > > > > "Check and repair require full filesystem scans, but resource and > > lock > > acquisition follow the same paths as regular filesystem accesses." > Sounds good, I think that will help a lot > > > > > > > +Metadata structures in this last category summarize the contents > > > > of > > > > primary > > > > +metadata records. > > > > +These are often used to speed up resource usage queries, and are > > > > many times > > > > +smaller than the primary metadata which they represent. > > > > +Check and repair both require full filesystem scans, but > > > > resource > > > > and lock > > > > +acquisition follow the same paths as regular filesystem > > > > accesses. > > > > + > > > > +The superblock summary counters have special requirements due to > > > > the > > > > underlying > > > > +implementation of the incore counters, and will be treated > > > > separately. > > > > +Check and repair of the other types of summary counters (quota > > > > resource counts > > > > +and file link counts) employ the same filesystem scanning and > > > > hooking > > > > +techniques as outlined above, but because the underlying data > > > > are > > > > sets of > > > > +integer counters, the staging data need not be a fully > > > > functional > > > > mirror of the > > > > +ondisk structure. > > > > + > > > > +Inspiration for quota and file link count repair strategies were > > > > drawn from > > > > +sections 2.12 ("Online Index Operations") through 2.14 > > > > ("Incremental > > > > View > > > > +Maintenace") of G. Graefe, `"Concurrent Queries and Updates in > > > > Summary Views > > > > +and Their Indexes" > > > > +< > > > > http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf > > > > >` > > > > _, 2011. > > > I wonder if these citations would do better as foot notes? Just to > > > kinda keep the body of the document tidy and flowing well. > > > > Yes, if this were a paginated document. > > > > > > + > > > > +Since quotas are non-negative integer counts of resource usage, > > > > online > > > > +quotacheck can use the incremental view deltas described in > > > > section > > > > 2.14 to > > > > +track pending changes to the block and inode usage counts in > > > > each > > > > transaction, > > > > +and commit those changes to a dquot side file when the > > > > transaction > > > > commits. > > > > +Delta tracking is necessary for dquots because the index builder > > > > scans inodes, > > > > +whereas the data structure being rebuilt is an index of dquots. > > > > +Link count checking combines the view deltas and commit step > > > > into > > > > one because > > > > +it sets attributes of the objects being scanned instead of > > > > writing > > > > them to a > > > > +separate data structure. > > > > +Each online fsck function will be discussed as case studies > > > > later in > > > > this > > > > +document. > > > > + > > > > +Risk Management > > > > +--------------- > > > > + > > > > +During the development of online fsck, several risk factors were > > > > identified > > > > +that may make the feature unsuitable for certain distributors > > > > and > > > > users. > > > > +Steps can be taken to mitigate or eliminate those risks, though > > > > at a > > > > cost to > > > > +functionality. > > > > + > > > > +- **Decreased performance**: Adding metadata indices to the > > > > filesystem > > > > + increases the time cost of persisting changes to disk, and the > > > > reverse space > > > > + mapping and directory parent pointers are no exception. > > > > + System administrators who require the maximum performance can > > > > disable the > > > > + reverse mapping features at format time, though this choice > > > > dramatically > > > > + reduces the ability of online fsck to find inconsistencies and > > > > repair them. > > > > + > > > > +- **Incorrect repairs**: As with all software, there might be > > > > defects in the > > > > + software that result in incorrect repairs being written to the > > > > filesystem. > > > > + Systematic fuzz testing (detailed in the next section) is > > > > employed > > > > by the > > > > + authors to find bugs early, but it might not catch everything. > > > > + The kernel build system provides Kconfig options > > > > (``CONFIG_XFS_ONLINE_SCRUB`` > > > > + and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to > > > > choose > > > > not to > > > > + accept this risk. > > > > + The xfsprogs build system has a configure option (``--enable- > > > > scrub=no``) that > > > > + disables building of the ``xfs_scrub`` binary, though this is > > > > not > > > > a risk > > > > + mitigation if the kernel functionality remains enabled. > > > > + > > > > +- **Inability to repair**: Sometimes, a filesystem is too badly > > > > damaged to be > > > > + repairable. > > > > + If the keyspaces of several metadata indices overlap in some > > > > manner but a > > > > + coherent narrative cannot be formed from records collected, > > > > then > > > > the repair > > > > + fails. > > > > + To reduce the chance that a repair will fail with a dirty > > > > transaction and > > > > + render the filesystem unusable, the online repair functions > > > > have > > > > been > > > > + designed to stage and validate all new records before > > > > committing > > > > the new > > > > + structure. > > > > + > > > > +- **Misbehavior**: Online fsck requires many privileges -- raw > > > > IO to > > > > block > > > > + devices, opening files by handle, ignoring Unix discretionary > > > > access control, > > > > + and the ability to perform administrative changes. > > > > + Running this automatically in the background scares people, so > > > > the > > > > systemd > > > > + background service is configured to run with only the > > > > privileges > > > > required. > > > > + Obviously, this cannot address certain problems like the > > > > kernel > > > > crashing or > > > > + deadlocking, but it should be sufficient to prevent the scrub > > > > process from > > > > + escaping and reconfiguring the system. > > > > + The cron job does not have this protection. > > > > + > > > > > > I think the fuzz part is one I would consider letting go. All > > > features > > > need to go through a period of stabilizing, and we cant really > > > control > > > how some people respond to it, so I don't think this part adds > > > much. I > > > think the document would do well to be trimmed where it can so as > > > to > > > stay more focused > > > > It took me a minute to realize that this comment applies to the text > > below it. Right? > Yes, sorry for confusion :-) > > > > > > > +- **Fuzz Kiddiez**: There are many people now who seem to think > > > > that > > > > running > > > > + automated fuzz testing of ondisk artifacts to find mischevious > > > > behavior and > > > > + spraying exploit code onto the public mailing list for instant > > > > zero-day > > > > + disclosure is somehow of some social benefit. > > > > I want to keep this bit because it keeps happening[2]. Some folks > > (huawei/alibaba?) have started to try to fix the bugs that their > > robots > > find, and kudos to them! > > > > You might have noticed that Googlers turned their firehose back on > > and > > once again aren't doing anything to fix the problems they find. How > > very Googley of them. > > > > [2] https://lwn.net/Articles/904293/ > > Alrighty then > > > > > > + In the view of this author, the benefit is realized only when > > > > the > > > > fuzz > > > > + operators help to **fix** the flaws, but this opinion > > > > apparently > > > > is not > > > > + widely shared among security "researchers". > > > > + The XFS maintainers' continuing ability to manage these events > > > > presents an > > > > + ongoing risk to the stability of the development process. > > > > + Automated testing should front-load some of the risk while the > > > > feature is > > > > + considered EXPERIMENTAL. > > > > + > > > > +Many of these risks are inherent to software programming. > > > > +Despite this, it is hoped that this new functionality will prove > > > > useful in > > > > +reducing unexpected downtime. > > > > > > > > > > Paraphrasing and reorganizing suggestions aside, I think it looks > > > pretty good > > > > Ok, thank you! > > > > --D > > > > > Allison >