Re: [PATCH 01/14] xfs: document the motivation for online fsck design

Allison Henderson <allison.henderson@xxxxxxxxxx> · Sat, 7 Jan 2023 05:01:54 +0000

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> Start the first chapter of the online fsck design documentation.
> This covers the motivations for creating this in the first place.
> 
> Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> ---
>  Documentation/filesystems/index.rst                |    1 
>  .../filesystems/xfs-online-fsck-design.rst         |  199
> ++++++++++++++++++++
>  2 files changed, 200 insertions(+)
>  create mode 100644 Documentation/filesystems/xfs-online-fsck-
> design.rst
> 
> 
> diff --git a/Documentation/filesystems/index.rst
> b/Documentation/filesystems/index.rst
> index bee63d42e5ec..fbb2b5ada95b 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -123,4 +123,5 @@ Documentation for filesystem implementations.
>     vfat
>     xfs-delayed-logging-design
>     xfs-self-describing-metadata
> +   xfs-online-fsck-design
>     zonefs
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> new file mode 100644
> index 000000000000..25717ebb5f80
> --- /dev/null
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -0,0 +1,199 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. _xfs_online_fsck_design:
> +
> +..
> +        Mapping of heading styles within this document:
> +        Heading 1 uses "====" above and below
> +        Heading 2 uses "===="
> +        Heading 3 uses "----"
> +        Heading 4 uses "````"
> +        Heading 5 uses "^^^^"
> +        Heading 6 uses "~~~~"
> +        Heading 7 uses "...."
> +
> +        Sections are manually numbered because apparently that's
> what everyone
> +        does in the kernel.
> +
> +======================
> +XFS Online Fsck Design
> +======================
> +
> +This document captures the design of the online filesystem check
> feature for
> +XFS.
> +The purpose of this document is threefold:
> +
> +- To help kernel distributors understand exactly what the XFS online
> fsck
> +  feature is, and issues about which they should be aware.
> +
> +- To help people reading the code to familiarize themselves with the
> relevant
> +  concepts and design points before they start digging into the
> code.
> +
> +- To help developers maintaining the system by capturing the reasons
> +  supporting higher level decisionmaking.
nit: decision making

> +
> +As the online fsck code is merged, the links in this document to
> topic branches
> +will be replaced with links to code.
> +
> +This document is licensed under the terms of the GNU Public License,
> v2.
> +The primary author is Darrick J. Wong.
> +
> +This design document is split into seven parts.
> +Part 1 defines what fsck tools are and the motivations for writing a
> new one.
> +Parts 2 and 3 present a high level overview of how online fsck
> process works
> +and how it is tested to ensure correct functionality.
> +Part 4 discusses the user interface and the intended usage modes of
> the new
> +program.
> +Parts 5 and 6 show off the high level components and how they fit
> together, and
> +then present case studies of how each repair function actually
> works.
> +Part 7 sums up what has been discussed so far and speculates about
> what else
> +might be built atop online fsck.
> +
> +.. contents:: Table of Contents
> +   :local:
> +

Something that I've noticed in my training sessions is that often
times, less is more.  People really only absorb so much over a
particular duration of time, so sometimes having too much detail in the
context is not as helpful as you might think.  A lot of times,
paraphrasing excerpts to reflect the same info in a more compact format
will help you keep audience on track (a little longer at least). 

> +1. What is a Filesystem Check?
> +==============================
> +
> +A Unix filesystem has three main jobs: to provide a hierarchy of
> names through
> +which application programs can associate arbitrary blobs of data for
> any
> +length of time, to virtualize physical storage media across those
> names, and
> +to retrieve the named data blobs at any time.
Consider the following paraphrase:

A Unix filesystem has three main jobs:
 * Provide a hierarchy of names by which applications access data for a
length of time.
 * Store or retrieve that data at any time.
 * Virtualize physical storage media across those names

Also... I dont think it would be inappropriate to just skip the above,
and jump right into fsck.  That's a very limited view of a filesystem,
likely a reader seeking an fsck doc probably has some idea of what a fs
is otherwise supposed to be doing.  

> +The filesystem check (fsck) tool examines all the metadata in a
> filesystem
> +to look for errors.
> +Simple tools only check for obvious corruptions, but the more
> sophisticated
> +ones cross-reference metadata records to look for inconsistencies.
> +People do not like losing data, so most fsck tools also contains
> some ability
> +to deal with any problems found.

While simple tools can detect data corruptions, a filesystem check
(fsck) uses metadata records as a cross-reference to find and correct
more inconsistencies.

?

> +As a word of caution -- the primary goal of most Linux fsck tools is
> to restore
> +the filesystem metadata to a consistent state, not to maximize the
> data
> +recovered.
> +That precedent will not be challenged here.
> +
> +Filesystems of the 20th century generally lacked any redundancy in
> the ondisk
> +format, which means that fsck can only respond to errors by erasing
> files until
> +errors are no longer detected.
> +System administrators avoid data loss by increasing the number of
> separate
> +storage systems through the creation of backups; 

> and they avoid downtime by
> +increasing the redundancy of each storage system through the
> creation of RAID.
Mmm, raids help more for hardware failures right?  They dont really
have a notion of when the fs is corrupted.  While an fsck can help
navigate around a corruption possibly caused by a hardware failure, I
think it's really a different kind of redundancy. I think I'd probably
drop the last line and keep the selling point focused online repair.

> +More recent filesystem designs contain enough redundancy in their
> metadata that
> +it is now possible to regenerate data structures when non-
> catastrophic errors
> +occur; 

> this capability aids both strategies.
> +Over the past few years, XFS has added a storage space reverse
> mapping index to
> +make it easy to find which files or metadata objects think they own
> a
> +particular range of storage.
> +Efforts are under way to develop a similar reverse mapping index for
> the naming
> +hierarchy, which will involve storing directory parent pointers in
> each file.
> +With these two pieces in place, XFS uses secondary information to
> perform more
> +sophisticated repairs.
This part here I think I would either let go or relocate.  The topic of
this section is supposed to discuss roughly what a filesystem check is.
Ideally so we can start talking about how ofsck is different.  It feels
like a bit of a jump to suddenly hop into rmap and pptrs, and for
"sophisticated repairs" that we havn't really gotten into the details
of yet.  So I think it would read easier if we saved this part until we
start talking about how they are used later.  

> +
> +TLDR; Show Me the Code!
> +-----------------------
> +
> +Code is posted to the kernel.org git trees as follows:
> +`kernel changes
> <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git
> /log/?h=repair-symlink>`_,
> +`userspace changes
> <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.
> git/log/?h=scrub-media-scan-service>`_, and
> +`QA test changes
> <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.
> git/log/?h=repair-dirs>`_.
> +Each kernel patchset adding an online repair function will use the
> same branch
> +name across the kernel, xfsprogs, and fstests git repos.
> +
> +Existing Tools
> +--------------
> +
> +The online fsck tool described here will be the third tool in the
> history of
> +XFS (on Linux) to check and repair filesystems.
> +Two programs precede it:
> +
> +The first program, ``xfs_check``, was created as part of the XFS
> debugger
> +(``xfs_db``) and can only be used with unmounted filesystems.
> +It walks all metadata in the filesystem looking for inconsistencies
> in the
> +metadata, though it lacks any ability to repair what it finds.
> +Due to its high memory requirements and inability to repair things,
> this
> +program is now deprecated and will not be discussed further.
> +
> +The second program, ``xfs_repair``, was created to be faster and
> more robust
> +than the first program.
> +Like its predecessor, it can only be used with unmounted
> filesystems.
> +It uses extent-based in-memory data structures to reduce memory
> consumption,
> +and tries to schedule readahead IO appropriately to reduce I/O
> waiting time
> +while it scans the metadata of the entire filesystem.
> +The most important feature of this tool is its ability to respond to
> +inconsistencies in file metadata and directory tree by erasing
> things as needed
> +to eliminate problems.
> +Space usage metadata are rebuilt from the observed file metadata.
> +
> +Problem Statement
> +-----------------
> +
> +The current XFS tools leave several problems unsolved:
> +
> +1. **User programs** suddenly **lose access** to information in the
> computer
> +   when unexpected shutdowns occur as a result of silent corruptions
> in the
> +   filesystem metadata.
> +   These occur **unpredictably** and often without warning.

1. **User programs** suddenly **lose access** to the filesystem
   when unexpected shutdowns occur as a result of silent corruptions
that could have otherwise been avoided with an online repair

While some of these issues are not untrue, I think it makes sense to
limit them to the issue you plan to solve, and therefore discuss.

> +
> +2. **Users** experience a **total loss of service** during the
> recovery period
> +   after an **unexpected shutdown** occurs.
> +
> +3. **Users** experience a **total loss of service** if the
> filesystem is taken
> +   offline to **look for problems** proactively.
> +
> +4. **Data owners** cannot **check the integrity** of their stored
> data without
> +   reading all of it.

> +   This may expose them to substantial billing costs when a linear
> media scan
> +   might suffice.
Ok, I had to re-read this one a few times, but I think this reads a
little cleaner:

    Customers that are billed for data egress may incur unnecessary
cost when a background media scan on the host may have sufficed

?

> +
> +5. **System administrators** cannot **schedule** a maintenance
> window to deal
> +   with corruptions if they **lack the means** to assess filesystem
> health
> +   while the filesystem is online.
> +
> +6. **Fleet monitoring tools** cannot **automate periodic checks** of
> filesystem
> +   health when doing so requires **manual intervention** and
> downtime.
> +
> +7. **Users** can be tricked into **doing things they do not desire**
> when
> +   malicious actors **exploit quirks of Unicode** to place
> misleading names
> +   in directories.
hrmm, I guess I'm not immediately extrapolating what things users are
being tricked into doing, or how ofsck solves this?  Otherwise I might
drop the last one here, I think the rest of the bullets are plenty of
motivation.

> +
> +Given this definition of the problems to be solved and the actors
> who would
> +benefit, the proposed solution is a third fsck tool that acts on a
> running
> +filesystem.
> +
> +This new third program has three components: an in-kernel facility
> to check
> +metadata, an in-kernel facility to repair metadata, and a userspace
> driver
> +program to drive fsck activity on a live filesystem.
> +``xfs_scrub`` is the name of the driver program.
> +The rest of this document presents the goals and use cases of the
> new fsck
> +tool, describes its major design points in connection to those
> goals, and
> +discusses the similarities and differences with existing tools.
> +
> ++-------------------------------------------------------------------
> -------+
> +|
> **Note**:                                                            
>     |
> ++-------------------------------------------------------------------
> -------+
> +| Throughout this document, the existing offline fsck tool can also
> be     |
> +| referred to by its current name
> "``xfs_repair``".                        |
> +| The userspace driver program for the new online fsck tool can
> be         |
> +| referred to as
> "``xfs_scrub``".                                          |
> +| The kernel portion of online fsck that validates metadata is
> called      |
> +| "online scrub", and portion of the kernel that fixes metadata is
> called  |
> +| "online
> repair".                                                         |
> ++-------------------------------------------------------------------
> -------+
> 

Hmm, maybe here might be a good spot to move rmap and pptrs?  It's not
otherwise clear to me what "secondary metadata" is.  If that is what it
is meant to refer to, I think the reader will more intuitively make the
connection if those two blurbs appear in the same context.
> +
> +Secondary metadata indices enable the reconstruction of parts of a
> damaged
> +primary metadata object from secondary information.

I would take out this blurb...
> +XFS filesystems shard themselves into multiple primary objects to
> enable better
> +performance on highly threaded systems and to contain the blast
> radius when
> +problems happen.

> +The naming hierarchy is broken up into objects known as directories
> and files;
> +and the physical space is split into pieces known as allocation
> groups.
And add here:

"This enables better performance on highly threaded systems and helps
to contain corruptions when they occur."

I think that reads cleaner

> +The division of the filesystem into principal objects (allocation
> groups and
> +inodes) means that there are ample opportunities to perform targeted
> checks and
> +repairs on a subset of the filesystem.
> +While this is going on, other parts continue processing IO requests.
> +Even if a piece of filesystem metadata can only be regenerated by
> scanning the
> +entire system, the scan can still be done in the background while
> other file
> +operations continue.
> +
> +In summary, online fsck takes advantage of resource sharding and
> redundant
> +metadata to enable targeted checking and repair operations while the
> system
> +is running.
> +This capability will be coupled to automatic system management so
> that
> +autonomous self-healing of XFS maximizes service availability.
> 

Nits and paraphrases aside, I think this looks pretty good?

Allison