Re: [PATCH 01/14] xfs: document the motivation for online fsck design

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2023-01-11 at 11:10 -0800, Darrick J. Wong wrote:
> On Sat, Jan 07, 2023 at 05:01:54AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > Start the first chapter of the online fsck design documentation.
> > > This covers the motivations for creating this in the first place.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > ---
> > >  Documentation/filesystems/index.rst                |    1 
> > >  .../filesystems/xfs-online-fsck-design.rst         |  199
> > > ++++++++++++++++++++
> > >  2 files changed, 200 insertions(+)
> > >  create mode 100644 Documentation/filesystems/xfs-online-fsck-
> > > design.rst
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/index.rst
> > > b/Documentation/filesystems/index.rst
> > > index bee63d42e5ec..fbb2b5ada95b 100644
> > > --- a/Documentation/filesystems/index.rst
> > > +++ b/Documentation/filesystems/index.rst
> > > @@ -123,4 +123,5 @@ Documentation for filesystem implementations.
> > >     vfat
> > >     xfs-delayed-logging-design
> > >     xfs-self-describing-metadata
> > > +   xfs-online-fsck-design
> > >     zonefs
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > new file mode 100644
> > > index 000000000000..25717ebb5f80
> > > --- /dev/null
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -0,0 +1,199 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +.. _xfs_online_fsck_design:
> > > +
> > > +..
> > > +        Mapping of heading styles within this document:
> > > +        Heading 1 uses "====" above and below
> > > +        Heading 2 uses "===="
> > > +        Heading 3 uses "----"
> > > +        Heading 4 uses "````"
> > > +        Heading 5 uses "^^^^"
> > > +        Heading 6 uses "~~~~"
> > > +        Heading 7 uses "...."
> > > +
> > > +        Sections are manually numbered because apparently that's
> > > what everyone
> > > +        does in the kernel.
> > > +
> > > +======================
> > > +XFS Online Fsck Design
> > > +======================
> > > +
> > > +This document captures the design of the online filesystem check
> > > feature for
> > > +XFS.
> > > +The purpose of this document is threefold:
> > > +
> > > +- To help kernel distributors understand exactly what the XFS
> > > online
> > > fsck
> > > +  feature is, and issues about which they should be aware.
> > > +
> > > +- To help people reading the code to familiarize themselves with
> > > the
> > > relevant
> > > +  concepts and design points before they start digging into the
> > > code.
> > > +
> > > +- To help developers maintaining the system by capturing the
> > > reasons
> > > +  supporting higher level decisionmaking.
> > nit: decision making
> 
> Fixed.
> 
> > > +
> > > +As the online fsck code is merged, the links in this document to
> > > topic branches
> > > +will be replaced with links to code.
> > > +
> > > +This document is licensed under the terms of the GNU Public
> > > License,
> > > v2.
> > > +The primary author is Darrick J. Wong.
> > > +
> > > +This design document is split into seven parts.
> > > +Part 1 defines what fsck tools are and the motivations for
> > > writing a
> > > new one.
> > > +Parts 2 and 3 present a high level overview of how online fsck
> > > process works
> > > +and how it is tested to ensure correct functionality.
> > > +Part 4 discusses the user interface and the intended usage modes
> > > of
> > > the new
> > > +program.
> > > +Parts 5 and 6 show off the high level components and how they
> > > fit
> > > together, and
> > > +then present case studies of how each repair function actually
> > > works.
> > > +Part 7 sums up what has been discussed so far and speculates
> > > about
> > > what else
> > > +might be built atop online fsck.
> > > +
> > > +.. contents:: Table of Contents
> > > +   :local:
> > > +
> > 
> > Something that I've noticed in my training sessions is that often
> > times, less is more.  People really only absorb so much over a
> > particular duration of time, so sometimes having too much detail in
> > the
> > context is not as helpful as you might think.  A lot of times,
> > paraphrasing excerpts to reflect the same info in a more compact
> > format
> > will help you keep audience on track (a little longer at least). 
> > 
> > > +1. What is a Filesystem Check?
> > > +==============================
> > > +
> > > +A Unix filesystem has three main jobs: to provide a hierarchy of
> > > names through
> > > +which application programs can associate arbitrary blobs of data
> > > for
> > > any
> > > +length of time, to virtualize physical storage media across
> > > those
> > > names, and
> > > +to retrieve the named data blobs at any time.
> > Consider the following paraphrase:
> > 
> > A Unix filesystem has three main jobs:
> >  * Provide a hierarchy of names by which applications access data
> > for a
> > length of time.
> >  * Store or retrieve that data at any time.
> >  * Virtualize physical storage media across those names
> 
> Ooh, listifying.  I did quite a bit of that to break up the walls of
> text in earlier revisions, but apparently I missed this one.
> 
> > Also... I dont think it would be inappropriate to just skip the
> > above,
> > and jump right into fsck.  That's a very limited view of a
> > filesystem,
> > likely a reader seeking an fsck doc probably has some idea of what
> > a fs
> > is otherwise supposed to be doing.  
> 
> This will become part of the general kernel documentation, so we
> can't
> assume that all readers are going to know what a fs really does.
> 
> "A Unix filesystem has four main responsibilities:
> 
> - Provide a hierarchy of names through which application programs can
>   associate arbitrary blobs of data for any length of time,
> 
> - Virtualize physical storage media across those names, and
> 
> - Retrieve the named data blobs at any time.
> 
> - Examine resource usage.
> 
> "Metadata directly supporting these functions (e.g. files,
> directories,
> space mappings) are sometimes called primary metadata.
> Secondary metadata (e.g. reverse mapping and directory parent
> pointers)
> support operations internal to the filesystem, such as internal
> consistency checking and reorganization."
Sure, I think that sounds good and helps to set up the metadata
concepts that are discussed later.
> 
> (I added those last two sentences in response to a point you made
> below.)
> 
> > > +The filesystem check (fsck) tool examines all the metadata in a
> > > filesystem
> > > +to look for errors.
> > > +Simple tools only check for obvious corruptions, but the more
> > > sophisticated
> > > +ones cross-reference metadata records to look for
> > > inconsistencies.
> > > +People do not like losing data, so most fsck tools also contains
> > > some ability
> > > +to deal with any problems found.
> > 
> > While simple tools can detect data corruptions, a filesystem check
> > (fsck) uses metadata records as a cross-reference to find and
> > correct
> > more inconsistencies.
> > 
> > ?
> 
> Let's be careful with the term 'data corruption' here -- a lot of
> people
> (well ok me) will see that as *user* data corruption, whereas we're
> talking about *metadata* corruption.
> 
> I think I'll rework that second sentence further:
> 
> "In addition to looking for obvious metadata corruptions, fsck also
> cross-references different types of metadata records with each other
> to
> look for inconsistencies."
> 
Alrighty, that sounds good

> Since the really dumb fscks of the 1970s are a long ways past now.
> 
> > > +As a word of caution -- the primary goal of most Linux fsck
> > > tools is
> > > to restore
> > > +the filesystem metadata to a consistent state, not to maximize
> > > the
> > > data
> > > +recovered.
> > > +That precedent will not be challenged here.
> > > +
> > > +Filesystems of the 20th century generally lacked any redundancy
> > > in
> > > the ondisk
> > > +format, which means that fsck can only respond to errors by
> > > erasing
> > > files until
> > > +errors are no longer detected.
> > > +System administrators avoid data loss by increasing the number
> > > of
> > > separate
> > > +storage systems through the creation of backups; 
> > 
> > 
> > > and they avoid downtime by
> > > +increasing the redundancy of each storage system through the
> > > creation of RAID.
> > Mmm, raids help more for hardware failures right?  They dont really
> > have a notion of when the fs is corrupted.
> 
> Right.
> 
> > While an fsck can help
> > navigate around a corruption possibly caused by a hardware failure,
> > I
> > think it's really a different kind of redundancy. I think I'd
> > probably
> > drop the last line and keep the selling point focused online
> > repair.
> 
> Yes, RAIDs provide a totally different type of redundancy.  I decided
> to
> make this point specifically to counter the people who argue that
> RAID
> makes them impervious to corruption problems, etc.
> 
> This attitude seemed rather prevalent in the early days of btrfs and
> a
> certain other filesystem that Shall Not Be Named, even though the
> btrfs
> developers themselves acknowledge this distinction, given the
> existence
> of `btrfs scrub' and `btrfs check'.
> 
> However you do have a good point that this sentence doesn't add much
> where it is.  I think I'll add it as a sidebar at the end of the
> paragraph.
> 
> > > +More recent filesystem designs contain enough redundancy in
> > > their
> > > metadata that
> > > +it is now possible to regenerate data structures when non-
> > > catastrophic errors
> > > +occur; 
> > 
> > 
> > > this capability aids both strategies.
> > > +Over the past few years, XFS has added a storage space reverse
> > > mapping index to
> > > +make it easy to find which files or metadata objects think they
> > > own
> > > a
> > > +particular range of storage.
> > > +Efforts are under way to develop a similar reverse mapping index
> > > for
> > > the naming
> > > +hierarchy, which will involve storing directory parent pointers
> > > in
> > > each file.
> > > +With these two pieces in place, XFS uses secondary information
> > > to
> > > perform more
> > > +sophisticated repairs.
> > This part here I think I would either let go or relocate.  The
> > topic of
> > this section is supposed to discuss roughly what a filesystem check
> > is.
> > Ideally so we can start talking about how ofsck is different.  It
> > feels
> > like a bit of a jump to suddenly hop into rmap and pptrs, and for
> > "sophisticated repairs" that we havn't really gotten into the
> > details
> > of yet.  So I think it would read easier if we saved this part
> > until we
> > start talking about how they are used later.  
> 
> Agreed.
> 
> > > +
> > > +TLDR; Show Me the Code!
> > > +-----------------------
> > > +
> > > +Code is posted to the kernel.org git trees as follows:
> > > +`kernel changes
> > > <
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.g
> > > it
> > > /log/?h=repair-symlink>`_,
> > > +`userspace changes
> > > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-
> > > dev.
> > > git/log/?h=scrub-media-scan-service>`_, and
> > > +`QA test changes
> > > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-
> > > dev.
> > > git/log/?h=repair-dirs>`_.
> > > +Each kernel patchset adding an online repair function will use
> > > the
> > > same branch
> > > +name across the kernel, xfsprogs, and fstests git repos.
> > > +
> > > +Existing Tools
> > > +--------------
> > > +
> > > +The online fsck tool described here will be the third tool in
> > > the
> > > history of
> > > +XFS (on Linux) to check and repair filesystems.
> > > +Two programs precede it:
> > > +
> > > +The first program, ``xfs_check``, was created as part of the XFS
> > > debugger
> > > +(``xfs_db``) and can only be used with unmounted filesystems.
> > > +It walks all metadata in the filesystem looking for
> > > inconsistencies
> > > in the
> > > +metadata, though it lacks any ability to repair what it finds.
> > > +Due to its high memory requirements and inability to repair
> > > things,
> > > this
> > > +program is now deprecated and will not be discussed further.
> > > +
> > > +The second program, ``xfs_repair``, was created to be faster and
> > > more robust
> > > +than the first program.
> > > +Like its predecessor, it can only be used with unmounted
> > > filesystems.
> > > +It uses extent-based in-memory data structures to reduce memory
> > > consumption,
> > > +and tries to schedule readahead IO appropriately to reduce I/O
> > > waiting time
> > > +while it scans the metadata of the entire filesystem.
> > > +The most important feature of this tool is its ability to
> > > respond to
> > > +inconsistencies in file metadata and directory tree by erasing
> > > things as needed
> > > +to eliminate problems.
> > > +Space usage metadata are rebuilt from the observed file
> > > metadata.
> > > +
> > > +Problem Statement
> > > +-----------------
> > > +
> > > +The current XFS tools leave several problems unsolved:
> > > +
> > > +1. **User programs** suddenly **lose access** to information in
> > > the
> > > computer
> > > +   when unexpected shutdowns occur as a result of silent
> > > corruptions
> > > in the
> > > +   filesystem metadata.
> > > +   These occur **unpredictably** and often without warning.
> > 
> > 
> > 1. **User programs** suddenly **lose access** to the filesystem
> >    when unexpected shutdowns occur as a result of silent
> > corruptions
> > that could have otherwise been avoided with an online repair
> > 
> > While some of these issues are not untrue, I think it makes sense
> > to
> > limit them to the issue you plan to solve, and therefore discuss.
> 
> Fair enough, it's not like one loses /all/ the data in the computer.
> 
> That said, we're still in the problem definition phase, so I don't
> want
> to mention online repair just yet.
> 
> > > +2. **Users** experience a **total loss of service** during the
> > > recovery period
> > > +   after an **unexpected shutdown** occurs.
> > > +
> > > +3. **Users** experience a **total loss of service** if the
> > > filesystem is taken
> > > +   offline to **look for problems** proactively.
> > > +
> > > +4. **Data owners** cannot **check the integrity** of their
> > > stored
> > > data without
> > > +   reading all of it.
> > 
> > > +   This may expose them to substantial billing costs when a
> > > linear
> > > media scan
> > > +   might suffice.
> > Ok, I had to re-read this one a few times, but I think this reads a
> > little cleaner:
> > 
> >     Customers that are billed for data egress may incur unnecessary
> > cost when a background media scan on the host may have sufficed
> > 
> > ?
> 
> "...when a linear media scan performed by the storage system
> administrator would suffice."
> 
That sounds fine to me

> I was tempted to say "storage owner" instead of "storage system
> administrator" but that sounded a little too IBM.
> 
> > > +5. **System administrators** cannot **schedule** a maintenance
> > > window to deal
> > > +   with corruptions if they **lack the means** to assess
> > > filesystem
> > > health
> > > +   while the filesystem is online.
> > > +
> > > +6. **Fleet monitoring tools** cannot **automate periodic
> > > checks** of
> > > filesystem
> > > +   health when doing so requires **manual intervention** and
> > > downtime.
> > > +
> > > +7. **Users** can be tricked into **doing things they do not
> > > desire**
> > > when
> > > +   malicious actors **exploit quirks of Unicode** to place
> > > misleading names
> > > +   in directories.
> > hrmm, I guess I'm not immediately extrapolating what things users
> > are
> > being tricked into doing, or how ofsck solves this?  Otherwise I
> > might
> > drop the last one here, I think the rest of the bullets are plenty
> > of
> > motivation.
> 
> The doc gets into this later[1], but it's possible to create two
> entries
> within the same directory that have different byte sequences in the
> name
> but render identically in file choosers.  These pathnames:
> 
> /home/djwong/Downloads/rustup.sh
> /home/djwong/Downloads/rus<zero width space>tup.sh
> 
> refer to different files, but a naïve file open dialog will render
> them
> identically as "rustup.sh".  If the first is the Rust installer and
> the
> second name is actually a ransomware payload, I can victimize you by
> tricking you into opening the wrong one.
> 
> Firefox had a whole CVE over this in 2018:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1438025
> 
> xfs_scrub is (so far) the only linux filesystem fsck tool that will
> warn
> system administrators about this kind of thing.
> 
> See generic/453 and generic/454.
> 
> [1] https://djwong.org/docs/xfs-online-fsck-design/#id108
> 
hmm ok, how about:

7. Malicious attacks may use uncommon unicode characters to create file
names that resemble normal files, which may go undetected until the
filesystem is scanned.


?

> > > +
> > > +Given this definition of the problems to be solved and the
> > > actors
> > > who would
> > > +benefit, the proposed solution is a third fsck tool that acts on
> > > a
> > > running
> > > +filesystem.
> > > +
> > > +This new third program has three components: an in-kernel
> > > facility
> > > to check
> > > +metadata, an in-kernel facility to repair metadata, and a
> > > userspace
> > > driver
> > > +program to drive fsck activity on a live filesystem.
> > > +``xfs_scrub`` is the name of the driver program.
> > > +The rest of this document presents the goals and use cases of
> > > the
> > > new fsck
> > > +tool, describes its major design points in connection to those
> > > goals, and
> > > +discusses the similarities and differences with existing tools.
> > > +
> > > ++---------------------------------------------------------------
> > > ----
> > > -------+
> > > +|
> > > **Note**:                                                        
> > >     
> > >     |
> > > ++---------------------------------------------------------------
> > > ----
> > > -------+
> > > +| Throughout this document, the existing offline fsck tool can
> > > also
> > > be     |
> > > +| referred to by its current name
> > > "``xfs_repair``".                        |
> > > +| The userspace driver program for the new online fsck tool can
> > > be         |
> > > +| referred to as
> > > "``xfs_scrub``".                                          |
> > > +| The kernel portion of online fsck that validates metadata is
> > > called      |
> > > +| "online scrub", and portion of the kernel that fixes metadata
> > > is
> > > called  |
> > > +| "online
> > > repair".                                                        
> > > |
> > > ++---------------------------------------------------------------
> > > ----
> > > -------+
> 
> Errr ^^^^ is Evolution doing line wrapping here?
> 
> > Hmm, maybe here might be a good spot to move rmap and pptrs?  It's
> > not
> > otherwise clear to me what "secondary metadata" is.  If that is
> > what it
> > is meant to refer to, I think the reader will more intuitively make
> > the
> > connection if those two blurbs appear in the same context.
> 
> Ooh, you found a significant gap-- nowhere in this chapter do I
> actually
> define what is primary metadata.  Or secondary metadata.
> 
> > > +
> > > +Secondary metadata indices enable the reconstruction of parts of
> > > a
> > > damaged
> > > +primary metadata object from secondary information.
> > 
> > I would take out this blurb...
> > > +XFS filesystems shard themselves into multiple primary objects
> > > to
> > > enable better
> > > +performance on highly threaded systems and to contain the blast
> > > radius when
> > > +problems happen.
> > 
> > 
> > > +The naming hierarchy is broken up into objects known as
> > > directories
> > > and files;
> > > +and the physical space is split into pieces known as allocation
> > > groups.
> > And add here:
> > 
> > "This enables better performance on highly threaded systems and
> > helps
> > to contain corruptions when they occur."
> > 
> > I think that reads cleaner
> 
> Ok.  Mind if I reword this slightly?  The entire paragraph now reads
> like this:
> 
> "The naming hierarchy is broken up into objects known as directories
> and
> files and the physical space is split into pieces known as allocation
> groups.  Sharding enables better performance on highly parallel
> systems
> and helps to contain the damage when corruptions occur.  The division
> of
> the filesystem into principal objects (allocation groups and inodes)
> means that there are ample opportunities to perform targeted checks
> and
> repairs on a subset of the filesystem."
I think that sounds cleaner

> 
> > > +The division of the filesystem into principal objects
> > > (allocation
> > > groups and
> > > +inodes) means that there are ample opportunities to perform
> > > targeted
> > > checks and
> > > +repairs on a subset of the filesystem.
> > > +While this is going on, other parts continue processing IO
> > > requests.
> > > +Even if a piece of filesystem metadata can only be regenerated
> > > by
> > > scanning the
> > > +entire system, the scan can still be done in the background
> > > while
> > > other file
> > > +operations continue.
> > > +
> > > +In summary, online fsck takes advantage of resource sharding and
> > > redundant
> > > +metadata to enable targeted checking and repair operations while
> > > the
> > > system
> > > +is running.
> > > +This capability will be coupled to automatic system management
> > > so
> > > that
> > > +autonomous self-healing of XFS maximizes service availability.
> > > 
> > 
> > Nits and paraphrases aside, I think this looks pretty good?
> 
> Woot.  Thanks for digging in! :)
> 
Sure, no problem!

> > Allison
> > 





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux