Re: [PATCH 02/14] xfs: document the general theory underlying online fsck design

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 11, 2023 at 01:25:12AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > Start the second chapter of the online fsck design documentation.
> > This covers the general theory underlying how online fsck works.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  366
> > ++++++++++++++++++++
> >  1 file changed, 366 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 25717ebb5f80..a03a7b9f0250 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -197,3 +197,369 @@ metadata to enable targeted checking and repair
> > operations while the system
> >  is running.
> >  This capability will be coupled to automatic system management so
> > that
> >  autonomous self-healing of XFS maximizes service availability.
> > +
> > +2. Theory of Operation
> > +======================
> > +
> > +Because it is necessary for online fsck to lock and scan live
> > metadata objects,
> > +online fsck consists of three separate code components.
> > +The first is the userspace driver program ``xfs_scrub``, which is
> > responsible
> > +for identifying individual metadata items, scheduling work items for
> > them,
> > +reacting to the outcomes appropriately, and reporting results to the
> > system
> > +administrator.
> > +The second and third are in the kernel, which implements functions
> > to check
> > +and repair each type of online fsck work item.
> > +
> > ++------------------------------------------------------------------+
> > +| **Note**:                                                        |
> > ++------------------------------------------------------------------+
> > +| For brevity, this document shortens the phrase "online fsck work |
> > +| item" to "scrub item".                                           |
> > ++------------------------------------------------------------------+
> > +
> > +Scrub item types are delineated in a manner consistent with the Unix
> > design
> > +philosophy, which is to say that each item should handle one aspect
> > of a
> > +metadata structure, and handle it well.
> > +
> > +Scope
> > +-----
> > +
> > +In principle, online fsck should be able to check and to repair
> > everything that
> > +the offline fsck program can handle.
> > +However, the adjective *online* brings with it the limitation that
> > online fsck
> > +cannot deal with anything that prevents the filesystem from going on
> > line, i.e.
> > +mounting.
> Are there really any other operations that do that other than mount?

No.

> I think this reads cleaner:
> 
> By definition, online fsck can only check and repair an online
> filesystem.  It cannot check mounting operations which start from an
> offline state.

Now that I think about this some more, this whole sentence doesn't make
sense.  xfs_scrub can *definitely* detect and fix latent errors that
would prevent the /next/ mount from succeeding.  It's only the fuzz test
suite that stumbles over this, and only because xfs_db cannot fuzz
mounted filesystems.

"However, online fsck cannot be running 100% of the time, which means
that latent errors may creep in after a scrub completes.
If these errors cause the next mount to fail, offline fsck is the only
solution."

> > +This limitation means that maintenance of the offline fsck tool will
> > continue.
> > +A second limitation of online fsck is that it must follow the same
> > resource
> > +sharing and lock acquisition rules as the regular filesystem.
> > +This means that scrub cannot take *any* shortcuts to save time,
> > because doing
> > +so could lead to concurrency problems.
> > +In other words, online fsck will never be able to fix 100% of the
> > +inconsistencies that offline fsck can repair, 
> Hmm, what inconsistencies cannot repaired as a result of the "no
> shortcut" rule?  I'm all for keeping things short and to the point, but
> since this section is about scope, I'd give it at least a brief bullet
> list

Hmm.  I can't think of any off the top of my head.  Given the rewording
earlier, I think it's more accurate to say:

"In other words, online fsck is not a complete replacement for offline
fsck, and a complete run of online fsck may take longer than online
fsck."

> > and a complete run of online fsck
> > +may take longer.
> > +However, both of these limitations are acceptable tradeoffs to
> > satisfy the
> > +different motivations of online fsck, which are to **minimize system
> > downtime**
> > +and to **increase predictability of operation**.
> > +
> > +.. _scrubphases:
> > +
> > +Phases of Work
> > +--------------
> > +
> > +The userspace driver program ``xfs_scrub`` splits the work of
> > checking and
> > +repairing an entire filesystem into seven phases.
> > +Each phase concentrates on checking specific types of scrub items
> > and depends
> > +on the success of all previous phases.
> > +The seven phases are as follows:
> > +
> > +1. Collect geometry information about the mounted filesystem and
> > computer,
> > +   discover the online fsck capabilities of the kernel, and open the
> > +   underlying storage devices.
> > +
> > +2. Check allocation group metadata, all realtime volume metadata,
> > and all quota
> > +   files.
> > +   Each metadata structure is scheduled as a separate scrub item.
> Like an intent item?

No, these scrub items are struct scrub_item objects that exist solely
within the userspace program code.

> > +   If corruption is found in the inode header or inode btree and
> > ``xfs_scrub``
> > +   is permitted to perform repairs, then those scrub items are
> > repaired to
> > +   prepare for phase 3.
> > +   Repairs are implemented by resubmitting the scrub item to the
> > kernel with
> If I'm understanding this correctly:
> Repairs are implemented as intent items that are queued and committed
> just as any filesystem operation.
> 
> ?

I don't want to go too deep into this prematurely, but...

xfs_scrub (the userspace program) needs to track which metadata objects
have been checked and which ones need repairs.  The current codebase
(ab)uses struct xfs_scrub_metadata, but it's very memory inefficient.
I replaced it with a new struct scrub_item that stores (a) all the
handle information to identify the inode/AG/rt group/whatever; and (b)
the state of all the checks that can be applied to that item:

struct scrub_item {
	/*
	 * Information we need to call the scrub and repair ioctls.
	 * Per-AG items should set the ino/gen fields to -1; per-inode
	 * items should set sri_agno to -1; and per-fs items should set
	 * all three fields to -1.  Or use the macros below.
	 */
	__u64			sri_ino;
	__u32			sri_gen;
	__u32			sri_agno;

	/* Bitmask of scrub types that were scheduled here. */
	__u32			sri_selected;

	/* Scrub item state flags, one for each XFS_SCRUB_TYPE. */
	__u8			sri_state[XFS_SCRUB_TYPE_NR];

	/* Track scrub and repair call retries for each scrub type. */
	__u8			sri_tries[XFS_SCRUB_TYPE_NR];

	/* Were there any corruption repairs needed? */
	bool			sri_inconsistent:1;

	/* Are we revalidating after repairs? */
	bool			sri_revalidate:1;
};

The first three fields are passed to the kernel via scrub ioctl and
describe a particular xfs domain (files, AGs, etc).  The rest of the
structure store state for each type of repair that can be performed
against that domain.

IOWs, xfs_scrub uses struct scrub_item objects to generate ioctl calls
to the kernel to check and repair things.  The kernel reads the ioctl
information, figures out what needs to be done, and then does the usual
get transaction -> lock things -> make updates -> commit dance to make
corrections to the fs.  Those corrections include log intent items, but
there's no tight coupling between log intent items and scrub_items.

Side note: The kernel repair code used to use intents to rebuild a
structure, but nowadays it use the btree bulk loader code to replace
btrees wholesale and in a single atomic commit.  Now we use them
primariliy to free preallocated space if the repair fails.

> > +   the repair flag enabled; this is discussed in the next section.
> > +   Optimizations and all other repairs are deferred to phase 4.
> I guess I'll come back to it. 
> 
> > +
> > +3. Check all metadata of every file in the filesystem.
> > +   Each metadata structure is also scheduled as a separate scrub
> > item.
> > +   If repairs are needed, ``xfs_scrub`` is permitted to perform
> > repairs,
> If repairs are needed and ``xfs_scrub`` is permitted

Fixed.

> ?
> > +   and there were no problems detected during phase 2, then those
> > scrub items
> > +   are repaired.
> > +   Optimizations and unsuccessful repairs are deferred to phase 4.
> > +
> > +4. All remaining repairs and scheduled optimizations are performed
> > during this
> > +   phase, if the caller permits them.
> > +   Before starting repairs, the summary counters are checked and any
> Did we talk about summary counters yet?  Maybe worth a blub. Otherwise
> this may not make sense with out skipping ahead or into the code

Nope.  I'll add that to the previous patch when I introduce primary and
secondary metadata.  Good catch!

"Summary metadata, as the name implies, condense information contained
in primary metadata for performance reasons."

> > necessary
> > +   repairs are performed so that subsequent repairs will not fail
> > the resource
> > +   reservation step due to wildly incorrect summary counters.
> > +   Unsuccesful repairs are requeued as long as forward progress on
> > repairs is
> > +   made somewhere in the filesystem.
> > +   Free space in the filesystem is trimmed at the end of phase 4 if
> > the
> > +   filesystem is clean.
> > +
> > +5. By the start of this phase, all primary and secondary filesystem
> > metadata
> > +   must be correct.
> I think maybe the definitions of primary and secondary metadata should
> move up before the phases section.  Otherwise the reader has to skip
> ahead to know what that means.

Yep, now primary, secondary, and summary metadata are defined in section
1.  Very good comment.

> > +   Summary counters such as the free space counts and quota resource
> > counts
> > +   are checked and corrected.
> > +   Directory entry names and extended attribute names are checked
> > for
> > +   suspicious entries such as control characters or confusing
> > Unicode sequences
> > +   appearing in names.
> > +
> > +6. If the caller asks for a media scan, read all allocated and
> > written data
> > +   file extents in the filesystem.
> > +   The ability to use hardware-assisted data file integrity checking
> > is new
> > +   to online fsck; neither of the previous tools have this
> > capability.
> > +   If media errors occur, they will be mapped to the owning files
> > and reported.
> > +
> > +7. Re-check the summary counters and presents the caller with a
> > summary of
> > +   space usage and file counts.
> > +
> > +Steps for Each Scrub Item
> > +-------------------------
> > +
> > +The kernel scrub code uses a three-step strategy for checking and
> > repairing
> > +the one aspect of a metadata object represented by a scrub item:
> > +
> > +1. The scrub item of intere
> > st is checked for corruptions; opportunities for
> > +   optimization; and for values that are directly controlled by the
> > system
> > +   administrator but look suspicious.
> > +   If the item is not corrupt or does not need optimization,
> > resource are
> > +   released and the positive scan results are returned to userspace.
> > +   If the item is corrupt or could be optimized but the caller does
> > not permit
> > +   this, resources are released and the negative scan results are
> > returned to
> > +   userspace.
> > +   Otherwise, the kernel moves on to the second step.
> > +
> > +2. The repair function is called to rebuild the data structure.
> > +   Repair functions generally choose rebuild a structure from other
> > metadata
> > +   rather than try to salvage the existing structure.
> > +   If the repair fails, the scan results from the first step are
> > returned to
> > +   userspace.
> > +   Otherwise, the kernel moves on to the third step.
> > +
> > +3. In the third step, the kernel runs the same checks over the new
> > metadata
> > +   item to assess the efficacy of the repairs.
> > +   The results of the reassessment are returned to userspace.
> > +
> > +Classification of Metadata
> > +--------------------------
> > +
> > +Each type of metadata object (and therefore each type of scrub item)
> > is
> > +classified as follows:
> > +
> > +Primary Metadata
> > +````````````````
> > +
> > +Metadata structures in this category should be most familiar to
> > filesystem
> > +users either because they are directly created by the user or they
> > index
> > +objects created by the user
> I think I would just jump straight into a brief list.  The above is a
> bit vague, and documentation that tells you you should already know
> what it is, doesnt add much.  Again, I think too much poetry might be
> why you're having a hard time getting responses.

Done:

- Free space and reference count information

- Inode records and indexes

- Storage mapping information for file data

- Directories

- Extended attributes

- Symbolic links

- Quota limits

- Link counts


> > +Most filesystem objects fall into this class.
> Most filesystem objects created by users fall into this class, such as
> inode, directories, allocation groups and so on.
> > +Resource and lock acquisition for scrub code follows the same order
> > as regular
> > +filesystem accesses.
> 
> Lock acquisition for these resources will follow the same order for
> scrub as a regular filesystem access.

Yes, that is clearer.  I think I'll phrase this more actively:

"Scrub obeys the same rules as regular filesystem accesses for resource
and lock acquisition."

> > +
> > +Primary metadata objects are the simplest for scrub to process.
> > +The principal filesystem object (either an allocation group or an
> > inode) that
> > +owns the item being scrubbed is locked to guard against concurrent
> > updates.
> > +The check function examines every record associated with the type
> > for obvious
> > +errors and cross-references healthy records against other metadata
> > to look for
> > +inconsistencies.
> > +Repairs for this class of scrub item are simple, since the repair
> > function
> > +starts by holding all the resources acquired in the previous step.
> > +The repair function scans available metadata as needed to record all
> > the
> > +observations needed to complete the structure.
> > +Next, it stages the observations in a new ondisk structure and
> > commits it
> > +atomically to complete the repair.
> > +Finally, the storage from the old data structure are carefully
> > reaped.
> > +
> > +Because ``xfs_scrub`` locks a primary object for the duration of the
> > repair,
> > +this is effectively an offline repair operation performed on a
> > subset of the
> > +filesystem.
> > +This minimizes the complexity of the repair code because it is not
> > necessary to
> > +handle concurrent updates from other threads, nor is it necessary to
> > access
> > +any other part of the filesystem.
> > +As a result, indexed structures can be rebuilt very quickly, and
> > programs
> > +trying to access the damaged structure will be blocked until repairs
> > complete.
> > +The only infrastructure needed by the repair code are the staging
> > area for
> > +observations and a means to write new structures to disk.
> > +Despite these limitations, the advantage that online repair holds is
> > clear:
> > +targeted work on individual shards of the filesystem avoids total
> > loss of
> > +service.
> > +
> > +This mechanism is described in section 2.1 ("Off-Line Algorithm") of
> > +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index
> > Construction
> > +Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_,
> Hmm, this article is not displaying for me.  If the link is abandoned,
> probably there's not much need to keep it around

The actual paper is not directly available through that ACM link, but
the DOI is what I used to track down a paper copy(!) of that paper as
published in a journal.

(In turn, that journal is "Advances in Database Technology - EDBT 1992";
I found it in the NYU library.  Amazingly, they sold it to me.)

> > +*Extending Database Technology*, pp. 293-309, 1992.
> > +
> > +Most primary metadata repair functions stage their intermediate
> > results in an
> > +in-memory array prior to formatting the new ondisk structure, which
> > is very
> > +similar to the list-based algorithm discussed in section 2.3 ("List-
> > Based
> > +Algorithms") of Srinivasan.
> > +However, any data structure builder that maintains a resource lock
> > for the
> > +duration of the repair is *always* an offline algorithm.
> > +
> > +Secondary Metadata
> > +``````````````````
> > +
> > +Metadata structures in this category reflect records found in
> > primary metadata,
> 
> such as rmap and parent pointer attributes.  But they are only
> needed...
> 
> ?

Euugh, this section needs some restructuring to get rid of redundant
sentences.  How about:

"Metadata structures in this category reflect records found in primary
metadata, but are only needed for online fsck or for reorganization of
the filesystem.

"Secondary metadata include:

- Reverse mapping information

- Directory parent pointers

"This class of metadata is difficult for scrub to process because scrub
attaches to the secondary object but needs to check primary metadata,
which runs counter to the usual order of resource acquisition.
Frequently, this means that full filesystems scans are necessary to
rebuild the metadata.
Check functions..."

> > +but are only needed for online fsck or for reorganization of the
> > filesystem.
> > +Resource and lock acquisition for scrub code do not follow the same
> > order as
> > +regular filesystem accesses, and may involve full filesystem scans.
> > +
> > +Secondary metadata objects are difficult for scrub to process,
> > because scrub
> > +attaches to the secondary object but needs to check primary
> > metadata, which
> > +runs counter to the usual order of resource acquisition.
> bummer :-(

Yup.

> > +Check functions can be limited in scope to reduce runtime.
> > +Repairs, however, require a full scan of primary metadata, which can
> > take a
> > +long time to complete.
> > +Under these conditions, ``xfs_scrub`` cannot lock resources for the
> > entire
> > +duration of the repair.
> > +
> > +Instead, repair functions set up an in-memory staging structure to
> > store
> > +observations.
> > +Depending on the requirements of the specific repair function, the
> > staging
> 
> 
> > +index can have the same format as the ondisk structure, or it can
> > have a design
> > +specific to that repair function.
> ...will have either the same format as the ondisk structure or a
> structure specific to the repair function.

Fixed.

> > +The next step is to release all locks and start the filesystem scan.
> > +When the repair scanner needs to record an observation, the staging
> > data are
> > +locked long enough to apply the update.
> > +Simultaneously, the repair function hooks relevant parts of the
> > filesystem to
> > +apply updates to the staging data if the the update pertains to an
> > object that
> > +has already been scanned by the index builder.
> While a scan is in progress, function hooks are used to apply
> filesystem updates to both the object and the staging data if the
> object has already been scanned.
> 
> ?

The hooks are used to apply updates to the repair staging data, but they
don't apply regular filesystem updates.

The usual process runs something like this:

  Lock -> update -> update -> commit

With a scan in progress, say we hook the second update.  The instruction
flow becomes:

  Lock -> update -> update -> hook -> update staging data -> commit

Maybe something along the following would be better?

"While the filesystem scan is in progress, the repair function hooks the
filesystem so that it can apply pending filesystem updates to the
staging information."

> > +Once the scan is done, the owning object is re-locked, the live data
> > is used to
> > +write a new ondisk structure, and the repairs are committed
> > atomically.
> > +The hooks are disabled and the staging staging area is freed.
> > +Finally, the storage from the old data structure are carefully
> > reaped.
> > +
> > +Introducing concurrency helps online repair avoid various locking
> > problems, but
> > +comes at a high cost to code complexity.
> > +Live filesystem code has to be hooked so that the repair function
> > can observe
> > +updates in progress.
> > +The staging area has to become a fully functional parallel structure
> > so that
> > +updates can be merged from the hooks.
> > +Finally, the hook, the filesystem scan, and the inode locking model
> > must be
> > +sufficiently well integrated that a hook event can decide if a given
> > update
> > +should be applied to the staging structure.
> > +
> > +In theory, the scrub implementation could apply these same
> > techniques for
> > +primary metadata, but doing so would make it massively more complex
> > and less
> > +performant.
> > +Programs attempting to access the damaged structures are not blocked
> > from
> > +operation, which may cause application failure or an unplanned
> > filesystem
> > +shutdown.
> > +
> > +Inspiration for the secondary metadata repair strategy was drawn
> > from section
> > +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without
> > Side-File")
> > +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms
> > for
> > +Creating Indexes for Very Large Tables Without Quiescing Updates"
> > +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
> This one works
> 
> > +
> > +The sidecar index mentioned above bears some resemblance to the side
> > file
> > +method mentioned in Srinivasan and Mohan.
> > +Their method consists of an index builder that extracts relevant
> > record data to
> > +build the new structure as quickly as possible; and an auxiliary
> > structure that
> > +captures all updates that would be committed to the index by other
> > threads were
> > +the new index already online.
> > +After the index building scan finishes, the updates recorded in the
> > side file
> > +are applied to the new index.
> > +To avoid conflicts between the index builder and other writer
> > threads, the
> > +builder maintains a publicly visible cursor that tracks the progress
> > of the
> > +scan through the record space.
> > +To avoid duplication of work between the side file and the index
> > builder, side
> > +file updates are elided when the record ID for the update is greater
> > than the
> > +cursor position within the record ID space.
> > +
> > +To minimize changes to the rest of the codebase, XFS online repair
> > keeps the
> > +replacement index hidden until it's completely ready to go.
> > +In other words, there is no attempt to expose the keyspace of the
> > new index
> > +while repair is running.
> > +The complexity of such an approach would be very high and perhaps
> > more
> > +appropriate to building *new* indices.
> > +
> > +**Question**: Can the full scan and live update code used to
> > facilitate a
> > +repair also be used to implement a comprehensive check?
> > +
> > +*Answer*: Probably, though this has not been yet been studied.
> I kinda feel like discussion Q&As need to be wrapped up before we can
> call things done.  If this is all there was to the answer, then lets
> clean out the discussion notes.

Oh, the situation here is worse than that -- in theory, check would be
much stronger if each scrub function employed these live scans to build
a shadow copy of the metadata and then compared the records of both.

However, that increases the amount of work each scrubber has to do much
higher, and the runtime of those scrubbers would go up.  The other issue
is that live scan hooks would have to proliferate through much more of
the filesystem.  That's rather more invasive to the codebase than most
of fsck, so I want people to look at the usage models for the handful of
scrubbers that really require it before I spread it around elsewhere.
Making that kind of change isn't that difficult, but I want to merge
this stuff before moving on to experimenting with improvements of that
scale.

> > +
> > +Summary Information
> > +```````````````````
> > +
> Oh, perhaps this section could move up with the other metadata
> definitions.  That way the reader already has an idea of what these
> terms are referring to before we get into how they are used during the
> phases.

Yeah, I think/hope this will be less of a problem now that section 1
defines all three types of metadata.  The start of this section now
reads:

"Metadata structures in this last category summarize the contents of
primary metadata records.
These are often used to speed up resource usage queries, and are many
times smaller than the primary metadata which they represent.

Examples of summary information include:

- Summary counts of free space and inodes

- File link counts from directories

- Quota resource usage counts

"Check and repair require full filesystem scans, but resource and lock
acquisition follow the same paths as regular filesystem accesses."

> > +Metadata structures in this last category summarize the contents of
> > primary
> > +metadata records.
> > +These are often used to speed up resource usage queries, and are
> > many times
> > +smaller than the primary metadata which they represent.
> > +Check and repair both require full filesystem scans, but resource
> > and lock
> > +acquisition follow the same paths as regular filesystem accesses.
> > +
> > +The superblock summary counters have special requirements due to the
> > underlying
> > +implementation of the incore counters, and will be treated
> > separately.
> > +Check and repair of the other types of summary counters (quota
> > resource counts
> > +and file link counts) employ the same filesystem scanning and
> > hooking
> > +techniques as outlined above, but because the underlying data are
> > sets of
> > +integer counters, the staging data need not be a fully functional
> > mirror of the
> > +ondisk structure.
> > +
> > +Inspiration for quota and file link count repair strategies were
> > drawn from
> > +sections 2.12 ("Online Index Operations") through 2.14 ("Incremental
> > View
> > +Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in
> > Summary Views
> > +and Their Indexes"
> > +<
> > http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`
> > _, 2011.
> I wonder if these citations would do better as foot notes?  Just to
> kinda keep the body of the document tidy and flowing well.

Yes, if this were a paginated document.

> > +
> > +Since quotas are non-negative integer counts of resource usage,
> > online
> > +quotacheck can use the incremental view deltas described in section
> > 2.14 to
> > +track pending changes to the block and inode usage counts in each
> > transaction,
> > +and commit those changes to a dquot side file when the transaction
> > commits.
> > +Delta tracking is necessary for dquots because the index builder
> > scans inodes,
> > +whereas the data structure being rebuilt is an index of dquots.
> > +Link count checking combines the view deltas and commit step into
> > one because
> > +it sets attributes of the objects being scanned instead of writing
> > them to a
> > +separate data structure.
> > +Each online fsck function will be discussed as case studies later in
> > this
> > +document.
> > +
> > +Risk Management
> > +---------------
> > +
> > +During the development of online fsck, several risk factors were
> > identified
> > +that may make the feature unsuitable for certain distributors and
> > users.
> > +Steps can be taken to mitigate or eliminate those risks, though at a
> > cost to
> > +functionality.
> > +
> > +- **Decreased performance**: Adding metadata indices to the
> > filesystem
> > +  increases the time cost of persisting changes to disk, and the
> > reverse space
> > +  mapping and directory parent pointers are no exception.
> > +  System administrators who require the maximum performance can
> > disable the
> > +  reverse mapping features at format time, though this choice
> > dramatically
> > +  reduces the ability of online fsck to find inconsistencies and
> > repair them.
> > +
> > +- **Incorrect repairs**: As with all software, there might be
> > defects in the
> > +  software that result in incorrect repairs being written to the
> > filesystem.
> > +  Systematic fuzz testing (detailed in the next section) is employed
> > by the
> > +  authors to find bugs early, but it might not catch everything.
> > +  The kernel build system provides Kconfig options
> > (``CONFIG_XFS_ONLINE_SCRUB``
> > +  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose
> > not to
> > +  accept this risk.
> > +  The xfsprogs build system has a configure option (``--enable-
> > scrub=no``) that
> > +  disables building of the ``xfs_scrub`` binary, though this is not
> > a risk
> > +  mitigation if the kernel functionality remains enabled.
> > +
> > +- **Inability to repair**: Sometimes, a filesystem is too badly
> > damaged to be
> > +  repairable.
> > +  If the keyspaces of several metadata indices overlap in some
> > manner but a
> > +  coherent narrative cannot be formed from records collected, then
> > the repair
> > +  fails.
> > +  To reduce the chance that a repair will fail with a dirty
> > transaction and
> > +  render the filesystem unusable, the online repair functions have
> > been
> > +  designed to stage and validate all new records before committing
> > the new
> > +  structure.
> > +
> > +- **Misbehavior**: Online fsck requires many privileges -- raw IO to
> > block
> > +  devices, opening files by handle, ignoring Unix discretionary
> > access control,
> > +  and the ability to perform administrative changes.
> > +  Running this automatically in the background scares people, so the
> > systemd
> > +  background service is configured to run with only the privileges
> > required.
> > +  Obviously, this cannot address certain problems like the kernel
> > crashing or
> > +  deadlocking, but it should be sufficient to prevent the scrub
> > process from
> > +  escaping and reconfiguring the system.
> > +  The cron job does not have this protection.
> > +
> 
> I think the fuzz part is one I would consider letting go.  All features
> need to go through a period of stabilizing, and we cant really control
> how some people respond to it, so I don't think this part adds much.  I
> think the document would do well to be trimmed where it can so as to
> stay more focused 

It took me a minute to realize that this comment applies to the text
below it.  Right?

> > +- **Fuzz Kiddiez**: There are many people now who seem to think that
> > running
> > +  automated fuzz testing of ondisk artifacts to find mischevious
> > behavior and
> > +  spraying exploit code onto the public mailing list for instant
> > zero-day
> > +  disclosure is somehow of some social benefit.

I want to keep this bit because it keeps happening[2].  Some folks
(huawei/alibaba?) have started to try to fix the bugs that their robots
find, and kudos to them!

You might have noticed that Googlers turned their firehose back on and
once again aren't doing anything to fix the problems they find.  How
very Googley of them.

[2] https://lwn.net/Articles/904293/

> > +  In the view of this author, the benefit is realized only when the
> > fuzz
> > +  operators help to **fix** the flaws, but this opinion apparently
> > is not
> > +  widely shared among security "researchers".
> > +  The XFS maintainers' continuing ability to manage these events
> > presents an
> > +  ongoing risk to the stability of the development process.
> > +  Automated testing should front-load some of the risk while the
> > feature is
> > +  considered EXPERIMENTAL.
> > +
> > +Many of these risks are inherent to software programming.
> > +Despite this, it is hoped that this new functionality will prove
> > useful in
> > +reducing unexpected downtime.
> > 
> 
> Paraphrasing and reorganizing suggestions aside, I think it looks
> pretty good

Ok, thank you!

--D

> Allison



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux