On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Start the first chapter of the online fsck design documentation. > This covers the motivations for creating this in the first place. > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > Documentation/filesystems/index.rst | 1 > .../filesystems/xfs-online-fsck-design.rst | 199 > ++++++++++++++++++++ > 2 files changed, 200 insertions(+) > create mode 100644 Documentation/filesystems/xfs-online-fsck- > design.rst > > > diff --git a/Documentation/filesystems/index.rst > b/Documentation/filesystems/index.rst > index bee63d42e5ec..fbb2b5ada95b 100644 > --- a/Documentation/filesystems/index.rst > +++ b/Documentation/filesystems/index.rst > @@ -123,4 +123,5 @@ Documentation for filesystem implementations. > vfat > xfs-delayed-logging-design > xfs-self-describing-metadata > + xfs-online-fsck-design > zonefs > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > b/Documentation/filesystems/xfs-online-fsck-design.rst > new file mode 100644 > index 000000000000..25717ebb5f80 > --- /dev/null > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > @@ -0,0 +1,199 @@ > +.. SPDX-License-Identifier: GPL-2.0 > +.. _xfs_online_fsck_design: > + > +.. > + Mapping of heading styles within this document: > + Heading 1 uses "====" above and below > + Heading 2 uses "====" > + Heading 3 uses "----" > + Heading 4 uses "````" > + Heading 5 uses "^^^^" > + Heading 6 uses "~~~~" > + Heading 7 uses "...." > + > + Sections are manually numbered because apparently that's > what everyone > + does in the kernel. > + > +====================== > +XFS Online Fsck Design > +====================== > + > +This document captures the design of the online filesystem check > feature for > +XFS. > +The purpose of this document is threefold: > + > +- To help kernel distributors understand exactly what the XFS online > fsck > + feature is, and issues about which they should be aware. > + > +- To help people reading the code to familiarize themselves with the > relevant > + concepts and design points before they start digging into the > code. > + > +- To help developers maintaining the system by capturing the reasons > + supporting higher level decisionmaking. nit: decision making > + > +As the online fsck code is merged, the links in this document to > topic branches > +will be replaced with links to code. > + > +This document is licensed under the terms of the GNU Public License, > v2. > +The primary author is Darrick J. Wong. > + > +This design document is split into seven parts. > +Part 1 defines what fsck tools are and the motivations for writing a > new one. > +Parts 2 and 3 present a high level overview of how online fsck > process works > +and how it is tested to ensure correct functionality. > +Part 4 discusses the user interface and the intended usage modes of > the new > +program. > +Parts 5 and 6 show off the high level components and how they fit > together, and > +then present case studies of how each repair function actually > works. > +Part 7 sums up what has been discussed so far and speculates about > what else > +might be built atop online fsck. > + > +.. contents:: Table of Contents > + :local: > + Something that I've noticed in my training sessions is that often times, less is more. People really only absorb so much over a particular duration of time, so sometimes having too much detail in the context is not as helpful as you might think. A lot of times, paraphrasing excerpts to reflect the same info in a more compact format will help you keep audience on track (a little longer at least). > +1. What is a Filesystem Check? > +============================== > + > +A Unix filesystem has three main jobs: to provide a hierarchy of > names through > +which application programs can associate arbitrary blobs of data for > any > +length of time, to virtualize physical storage media across those > names, and > +to retrieve the named data blobs at any time. Consider the following paraphrase: A Unix filesystem has three main jobs: * Provide a hierarchy of names by which applications access data for a length of time. * Store or retrieve that data at any time. * Virtualize physical storage media across those names Also... I dont think it would be inappropriate to just skip the above, and jump right into fsck. That's a very limited view of a filesystem, likely a reader seeking an fsck doc probably has some idea of what a fs is otherwise supposed to be doing. > +The filesystem check (fsck) tool examines all the metadata in a > filesystem > +to look for errors. > +Simple tools only check for obvious corruptions, but the more > sophisticated > +ones cross-reference metadata records to look for inconsistencies. > +People do not like losing data, so most fsck tools also contains > some ability > +to deal with any problems found. While simple tools can detect data corruptions, a filesystem check (fsck) uses metadata records as a cross-reference to find and correct more inconsistencies. ? > +As a word of caution -- the primary goal of most Linux fsck tools is > to restore > +the filesystem metadata to a consistent state, not to maximize the > data > +recovered. > +That precedent will not be challenged here. > + > +Filesystems of the 20th century generally lacked any redundancy in > the ondisk > +format, which means that fsck can only respond to errors by erasing > files until > +errors are no longer detected. > +System administrators avoid data loss by increasing the number of > separate > +storage systems through the creation of backups; > and they avoid downtime by > +increasing the redundancy of each storage system through the > creation of RAID. Mmm, raids help more for hardware failures right? They dont really have a notion of when the fs is corrupted. While an fsck can help navigate around a corruption possibly caused by a hardware failure, I think it's really a different kind of redundancy. I think I'd probably drop the last line and keep the selling point focused online repair. > +More recent filesystem designs contain enough redundancy in their > metadata that > +it is now possible to regenerate data structures when non- > catastrophic errors > +occur; > this capability aids both strategies. > +Over the past few years, XFS has added a storage space reverse > mapping index to > +make it easy to find which files or metadata objects think they own > a > +particular range of storage. > +Efforts are under way to develop a similar reverse mapping index for > the naming > +hierarchy, which will involve storing directory parent pointers in > each file. > +With these two pieces in place, XFS uses secondary information to > perform more > +sophisticated repairs. This part here I think I would either let go or relocate. The topic of this section is supposed to discuss roughly what a filesystem check is. Ideally so we can start talking about how ofsck is different. It feels like a bit of a jump to suddenly hop into rmap and pptrs, and for "sophisticated repairs" that we havn't really gotten into the details of yet. So I think it would read easier if we saved this part until we start talking about how they are used later. > + > +TLDR; Show Me the Code! > +----------------------- > + > +Code is posted to the kernel.org git trees as follows: > +`kernel changes > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git > /log/?h=repair-symlink>`_, > +`userspace changes > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev. > git/log/?h=scrub-media-scan-service>`_, and > +`QA test changes > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev. > git/log/?h=repair-dirs>`_. > +Each kernel patchset adding an online repair function will use the > same branch > +name across the kernel, xfsprogs, and fstests git repos. > + > +Existing Tools > +-------------- > + > +The online fsck tool described here will be the third tool in the > history of > +XFS (on Linux) to check and repair filesystems. > +Two programs precede it: > + > +The first program, ``xfs_check``, was created as part of the XFS > debugger > +(``xfs_db``) and can only be used with unmounted filesystems. > +It walks all metadata in the filesystem looking for inconsistencies > in the > +metadata, though it lacks any ability to repair what it finds. > +Due to its high memory requirements and inability to repair things, > this > +program is now deprecated and will not be discussed further. > + > +The second program, ``xfs_repair``, was created to be faster and > more robust > +than the first program. > +Like its predecessor, it can only be used with unmounted > filesystems. > +It uses extent-based in-memory data structures to reduce memory > consumption, > +and tries to schedule readahead IO appropriately to reduce I/O > waiting time > +while it scans the metadata of the entire filesystem. > +The most important feature of this tool is its ability to respond to > +inconsistencies in file metadata and directory tree by erasing > things as needed > +to eliminate problems. > +Space usage metadata are rebuilt from the observed file metadata. > + > +Problem Statement > +----------------- > + > +The current XFS tools leave several problems unsolved: > + > +1. **User programs** suddenly **lose access** to information in the > computer > + when unexpected shutdowns occur as a result of silent corruptions > in the > + filesystem metadata. > + These occur **unpredictably** and often without warning. 1. **User programs** suddenly **lose access** to the filesystem when unexpected shutdowns occur as a result of silent corruptions that could have otherwise been avoided with an online repair While some of these issues are not untrue, I think it makes sense to limit them to the issue you plan to solve, and therefore discuss. > + > +2. **Users** experience a **total loss of service** during the > recovery period > + after an **unexpected shutdown** occurs. > + > +3. **Users** experience a **total loss of service** if the > filesystem is taken > + offline to **look for problems** proactively. > + > +4. **Data owners** cannot **check the integrity** of their stored > data without > + reading all of it. > + This may expose them to substantial billing costs when a linear > media scan > + might suffice. Ok, I had to re-read this one a few times, but I think this reads a little cleaner: Customers that are billed for data egress may incur unnecessary cost when a background media scan on the host may have sufficed ? > + > +5. **System administrators** cannot **schedule** a maintenance > window to deal > + with corruptions if they **lack the means** to assess filesystem > health > + while the filesystem is online. > + > +6. **Fleet monitoring tools** cannot **automate periodic checks** of > filesystem > + health when doing so requires **manual intervention** and > downtime. > + > +7. **Users** can be tricked into **doing things they do not desire** > when > + malicious actors **exploit quirks of Unicode** to place > misleading names > + in directories. hrmm, I guess I'm not immediately extrapolating what things users are being tricked into doing, or how ofsck solves this? Otherwise I might drop the last one here, I think the rest of the bullets are plenty of motivation. > + > +Given this definition of the problems to be solved and the actors > who would > +benefit, the proposed solution is a third fsck tool that acts on a > running > +filesystem. > + > +This new third program has three components: an in-kernel facility > to check > +metadata, an in-kernel facility to repair metadata, and a userspace > driver > +program to drive fsck activity on a live filesystem. > +``xfs_scrub`` is the name of the driver program. > +The rest of this document presents the goals and use cases of the > new fsck > +tool, describes its major design points in connection to those > goals, and > +discusses the similarities and differences with existing tools. > + > ++------------------------------------------------------------------- > -------+ > +| > **Note**: > | > ++------------------------------------------------------------------- > -------+ > +| Throughout this document, the existing offline fsck tool can also > be | > +| referred to by its current name > "``xfs_repair``". | > +| The userspace driver program for the new online fsck tool can > be | > +| referred to as > "``xfs_scrub``". | > +| The kernel portion of online fsck that validates metadata is > called | > +| "online scrub", and portion of the kernel that fixes metadata is > called | > +| "online > repair". | > ++------------------------------------------------------------------- > -------+ > Hmm, maybe here might be a good spot to move rmap and pptrs? It's not otherwise clear to me what "secondary metadata" is. If that is what it is meant to refer to, I think the reader will more intuitively make the connection if those two blurbs appear in the same context. > + > +Secondary metadata indices enable the reconstruction of parts of a > damaged > +primary metadata object from secondary information. I would take out this blurb... > +XFS filesystems shard themselves into multiple primary objects to > enable better > +performance on highly threaded systems and to contain the blast > radius when > +problems happen. > +The naming hierarchy is broken up into objects known as directories > and files; > +and the physical space is split into pieces known as allocation > groups. And add here: "This enables better performance on highly threaded systems and helps to contain corruptions when they occur." I think that reads cleaner > +The division of the filesystem into principal objects (allocation > groups and > +inodes) means that there are ample opportunities to perform targeted > checks and > +repairs on a subset of the filesystem. > +While this is going on, other parts continue processing IO requests. > +Even if a piece of filesystem metadata can only be regenerated by > scanning the > +entire system, the scan can still be done in the background while > other file > +operations continue. > + > +In summary, online fsck takes advantage of resource sharding and > redundant > +metadata to enable targeted checking and repair operations while the > system > +is running. > +This capability will be coupled to automatic system management so > that > +autonomous self-healing of XFS maximizes service availability. > Nits and paraphrases aside, I think this looks pretty good? Allison