From: Darrick J. Wong <djwong@xxxxxxxxxx> Start the fourth chapter of the online fsck design documentation, which discusses the user interface and the background scrubbing service. Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> --- .../filesystems/xfs-online-fsck-design.rst | 105 ++++++++++++++++++++ 1 file changed, 105 insertions(+) diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst index 536698b138b8..bdb4bdda3180 100644 --- a/Documentation/filesystems/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs-online-fsck-design.rst @@ -712,3 +712,108 @@ and the `evolution of existing per-function stress testing <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. Each kernel patchset adding an online repair function will use the same branch name across the kernel, xfsprogs, and fstests git repos. + +User Interface +============== + +Like offline fsck, the primary user of online fsck should be the system +administrator. +Online fsck presents two modes of operation to administrators: +A foreground CLI process for online fsck on demand, and a background service +that performs autonomous checking and repair. + +Checking on Demand +------------------ + +For administrators who want the absolute freshest information about the +metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on +a command line. +The program checks every piece of metadata in the filesystem while the +administrator waits for the results to be reported, just like the existing +``xfs_repair`` tool. +Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` +option to increase the verbosity of the information reported. + +A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error +correction capabilities of the hardware to check data file contents. +The media scan is not enabled by default because it may dramatically increase +program runtime and consume a lot of bandwidth on older storage hardware. + +The output of a foreground invocation will be captured in the system log. + +The ``xfs_scrub_all`` program walks the list of mounted filesystems and +initiates ``xfs_scrub`` for each of them in parallel. +It serializes scans for any filesystems that resolve to the same top level +kernel block device to prevent resource overconsumption. + +Background Service +------------------ + +To reduce the workload of system administrators, the ``xfs_scrub`` package +provides a suite of `systemd <https://systemd.io/>`_ timers and services that +run online fsck automatically on weekends. +The background service configures scrub to run with as little privilege as +possible (which is quite a lot), the lowest IO priority, and in a single +threaded mode to minimize the amount of load generated on the system to avoid +starving regular workloads. + +The output of the background service will also be captured in the system log. +If desired, reports of failures (either due to inconsistencies or mere runtime +errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment +variable in the following service files: + +* ``xfs_scrub_fail@.service`` +* ``xfs_scrub_media_fail@.service`` +* ``xfs_scrub_all_fail.service`` + +The decision to enable the background scan is left to the system administrator. +This can be done by enabling either of the following services: + +* ``xfs_scrub_all.timer`` on systemd systems to enable a weekly scan of the + metadata of all mounted filesystems. +* ``xfs_scrub_all.cron`` can be used on non-systemd systems to schedule a + weekly scan of all mounted filesystems. + +The automatic weekly scan is configured out of the box to perform an additional +media scan of all file data once per month. +This is less foolproof than, say, storing file data block checksums, but much +more performant if application software provides its own integrity checking, +redundancy can be provided elsewhere above the filesystem, or the storage +device's integrity guarantees are deemed sufficient. + +**Question**: Are we using systemd unit directives to their maximum advantage +to isolate the scrub process and control its resource usage? +**Question**: Should we document how system administrators can modify the +xfs_scrub@ service file to contain the QoS hit? +Or do we assume admins are familiar with existing systemd documentation? +Where do we even document that? + +Proposed patchsets include +`enabling the background service +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. + +Health Reporting +---------------- + +XFS caches a summary of each filesystem's health status in memory. +The information is updated whenever ``xfs_scrub`` is run, as well as whenever +inconsistencies are detected in the filesystem metadata. +System administrators can use the ``health`` command of ``xfs_spaceman`` to +download this information into a human-readable format. +If problems have been observed, the administrator can decide to schedule a +reduced service window in which to run the online repair tool to correct the +problem. +Failing that, the administrator can decide to schedule a maintenance window to +run the traditional offline repair tool to correct the problem. + +**Question**: Should the health reporting integrate with the new inotify fs +error notification system? +**Question**: Should we write a daemon to listen for corruption notifications +and initiate a repair? + +Proposed patchsets include +`wiring up health reports to correction returns +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ +and +`preservation of sickness info during memory reclaim +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.