[PATCH 05/24] docs: add XFS self-describing metadata integrity doc to DS&A book

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>

Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
---
 .../filesystems/xfs-self-describing-metadata.txt   |  350 -----------------
 Documentation/filesystems/xfs/ondisk/overview.rst  |    2 
 .../xfs/ondisk/self_describing_metadata.rst        |  402 ++++++++++++++++++++
 3 files changed, 404 insertions(+), 350 deletions(-)
 delete mode 100644 Documentation/filesystems/xfs-self-describing-metadata.txt
 create mode 100644 Documentation/filesystems/xfs/ondisk/self_describing_metadata.rst


diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt
deleted file mode 100644
index 05aa455163e3..000000000000
--- a/Documentation/filesystems/xfs-self-describing-metadata.txt
+++ /dev/null
@@ -1,350 +0,0 @@
-XFS Self Describing Metadata
-----------------------------
-
-Introduction
-------------
-
-The largest scalability problem facing XFS is not one of algorithmic
-scalability, but of verification of the filesystem structure. Scalabilty of the
-structures and indexes on disk and the algorithms for iterating them are
-adequate for supporting PB scale filesystems with billions of inodes, however it
-is this very scalability that causes the verification problem.
-
-Almost all metadata on XFS is dynamically allocated. The only fixed location
-metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
-other metadata structures need to be discovered by walking the filesystem
-structure in different ways. While this is already done by userspace tools for
-validating and repairing the structure, there are limits to what they can
-verify, and this in turn limits the supportable size of an XFS filesystem.
-
-For example, it is entirely possible to manually use xfs_db and a bit of
-scripting to analyse the structure of a 100TB filesystem when trying to
-determine the root cause of a corruption problem, but it is still mainly a
-manual task of verifying that things like single bit errors or misplaced writes
-weren't the ultimate cause of a corruption event. It may take a few hours to a
-few days to perform such forensic analysis, so for at this scale root cause
-analysis is entirely possible.
-
-However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
-to analyse and so that analysis blows out towards weeks/months of forensic work.
-Most of the analysis work is slow and tedious, so as the amount of analysis goes
-up, the more likely that the cause will be lost in the noise.  Hence the primary
-concern for supporting PB scale filesystems is minimising the time and effort
-required for basic forensic analysis of the filesystem structure.
-
-
-Self Describing Metadata
-------------------------
-
-One of the problems with the current metadata format is that apart from the
-magic number in the metadata block, we have no other way of identifying what it
-is supposed to be. We can't even identify if it is the right place. Put simply,
-you can't look at a single metadata block in isolation and say "yes, it is
-supposed to be there and the contents are valid".
-
-Hence most of the time spent on forensic analysis is spent doing basic
-verification of metadata values, looking for values that are in range (and hence
-not detected by automated verification checks) but are not correct. Finding and
-understanding how things like cross linked block lists (e.g. sibling
-pointers in a btree end up with loops in them) are the key to understanding what
-went wrong, but it is impossible to tell what order the blocks were linked into
-each other or written to disk after the fact.
-
-Hence we need to record more information into the metadata to allow us to
-quickly determine if the metadata is intact and can be ignored for the purpose
-of analysis. We can't protect against every possible type of error, but we can
-ensure that common types of errors are easily detectable.  Hence the concept of
-self describing metadata.
-
-The first, fundamental requirement of self describing metadata is that the
-metadata object contains some form of unique identifier in a well known
-location. This allows us to identify the expected contents of the block and
-hence parse and verify the metadata object. IF we can't independently identify
-the type of metadata in the object, then the metadata doesn't describe itself
-very well at all!
-
-Luckily, almost all XFS metadata has magic numbers embedded already - only the
-AGFL, remote symlinks and remote attribute blocks do not contain identifying
-magic numbers. Hence we can change the on-disk format of all these objects to
-add more identifying information and detect this simply by changing the magic
-numbers in the metadata objects. That is, if it has the current magic number,
-the metadata isn't self identifying. If it contains a new magic number, it is
-self identifying and we can do much more expansive automated verification of the
-metadata object at runtime, during forensic analysis or repair.
-
-As a primary concern, self describing metadata needs some form of overall
-integrity checking. We cannot trust the metadata if we cannot verify that it has
-not been changed as a result of external influences. Hence we need some form of
-integrity check, and this is done by adding CRC32c validation to the metadata
-block. If we can verify the block contains the metadata it was intended to
-contain, a large amount of the manual verification work can be skipped.
-
-CRC32c was selected as metadata cannot be more than 64k in length in XFS and
-hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
-metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
-fast. So while CRC32c is not the strongest of possible integrity checks that
-could be used, it is more than sufficient for our needs and has relatively
-little overhead. Adding support for larger integrity fields and/or algorithms
-does really provide any extra value over CRC32c, but it does add a lot of
-complexity and so there is no provision for changing the integrity checking
-mechanism.
-
-Self describing metadata needs to contain enough information so that the
-metadata block can be verified as being in the correct place without needing to
-look at any other metadata. This means it needs to contain location information.
-Just adding a block number to the metadata is not sufficient to protect against
-mis-directed writes - a write might be misdirected to the wrong LUN and so be
-written to the "correct block" of the wrong filesystem. Hence location
-information must contain a filesystem identifier as well as a block number.
-
-Another key information point in forensic analysis is knowing who the metadata
-block belongs to. We already know the type, the location, that it is valid
-and/or corrupted, and how long ago that it was last modified. Knowing the owner
-of the block is important as it allows us to find other related metadata to
-determine the scope of the corruption. For example, if we have a extent btree
-object, we don't know what inode it belongs to and hence have to walk the entire
-filesystem to find the owner of the block. Worse, the corruption could mean that
-no owner can be found (i.e. it's an orphan block), and so without an owner field
-in the metadata we have no idea of the scope of the corruption. If we have an
-owner field in the metadata object, we can immediately do top down validation to
-determine the scope of the problem.
-
-Different types of metadata have different owner identifiers. For example,
-directory, attribute and extent tree blocks are all owned by an inode, whilst
-freespace btree blocks are owned by an allocation group. Hence the size and
-contents of the owner field are determined by the type of metadata object we are
-looking at.  The owner information can also identify misplaced writes (e.g.
-freespace btree block written to the wrong AG).
-
-Self describing metadata also needs to contain some indication of when it was
-written to the filesystem. One of the key information points when doing forensic
-analysis is how recently the block was modified. Correlation of set of corrupted
-metadata blocks based on modification times is important as it can indicate
-whether the corruptions are related, whether there's been multiple corruption
-events that lead to the eventual failure, and even whether there are corruptions
-present that the run-time verification is not detecting.
-
-For example, we can determine whether a metadata object is supposed to be free
-space or still allocated if it is still referenced by its owner by looking at
-when the free space btree block that contains the block was last written
-compared to when the metadata object itself was last written.  If the free space
-block is more recent than the object and the object's owner, then there is a
-very good chance that the block should have been removed from the owner.
-
-To provide this "written timestamp", each metadata block gets the Log Sequence
-Number (LSN) of the most recent transaction it was modified on written into it.
-This number will always increase over the life of the filesystem, and the only
-thing that resets it is running xfs_repair on the filesystem. Further, by use of
-the LSN we can tell if the corrupted metadata all belonged to the same log
-checkpoint and hence have some idea of how much modification occurred between
-the first and last instance of corrupt metadata on disk and, further, how much
-modification occurred between the corruption being written and when it was
-detected.
-
-Runtime Validation
-------------------
-
-Validation of self-describing metadata takes place at runtime in two places:
-
-	- immediately after a successful read from disk
-	- immediately prior to write IO submission
-
-The verification is completely stateless - it is done independently of the
-modification process, and seeks only to check that the metadata is what it says
-it is and that the metadata fields are within bounds and internally consistent.
-As such, we cannot catch all types of corruption that can occur within a block
-as there may be certain limitations that operational state enforces of the
-metadata, or there may be corruption of interblock relationships (e.g. corrupted
-sibling pointer lists). Hence we still need stateful checking in the main code
-body, but in general most of the per-field validation is handled by the
-verifiers.
-
-For read verification, the caller needs to specify the expected type of metadata
-that it should see, and the IO completion process verifies that the metadata
-object matches what was expected. If the verification process fails, then it
-marks the object being read as EFSCORRUPTED. The caller needs to catch this
-error (same as for IO errors), and if it needs to take special action due to a
-verification error it can do so by catching the EFSCORRUPTED error value. If we
-need more discrimination of error type at higher levels, we can define new
-error numbers for different errors as necessary.
-
-The first step in read verification is checking the magic number and determining
-whether CRC validating is necessary. If it is, the CRC32c is calculated and
-compared against the value stored in the object itself. Once this is validated,
-further checks are made against the location information, followed by extensive
-object specific metadata validation. If any of these checks fail, then the
-buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
-
-Write verification is the opposite of the read verification - first the object
-is extensively verified and if it is OK we then update the LSN from the last
-modification made to the object, After this, we calculate the CRC and insert it
-into the object. Once this is done the write IO is allowed to continue. If any
-error occurs during this process, the buffer is again marked with a EFSCORRUPTED
-error for the higher layers to catch.
-
-Structures
-----------
-
-A typical on-disk structure needs to contain the following information:
-
-struct xfs_ondisk_hdr {
-        __be32  magic;		/* magic number */
-        __be32  crc;		/* CRC, not logged */
-        uuid_t  uuid;		/* filesystem identifier */
-        __be64  owner;		/* parent object */
-        __be64  blkno;		/* location on disk */
-        __be64  lsn;		/* last modification in log, not logged */
-};
-
-Depending on the metadata, this information may be part of a header structure
-separate to the metadata contents, or may be distributed through an existing
-structure. The latter occurs with metadata that already contains some of this
-information, such as the superblock and AG headers.
-
-Other metadata may have different formats for the information, but the same
-level of information is generally provided. For example:
-
-	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
-	  number for location. The two of these combined provide the same
-	  information as @owner and @blkno in eh above structure, but using 8
-	  bytes less space on disk.
-
-	- directory/attribute node blocks have a 16 bit magic number, and the
-	  header that contains the magic number has other information in it as
-	  well. hence the additional metadata headers change the overall format
-	  of the metadata.
-
-A typical buffer read verifier is structured as follows:
-
-#define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
-
-static void
-xfs_foo_read_verify(
-	struct xfs_buf	*bp)
-{
-       struct xfs_mount *mp = bp->b_target->bt_mount;
-
-        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
-             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
-					XFS_FOO_CRC_OFF)) ||
-            !xfs_foo_verify(bp)) {
-                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
-                xfs_buf_ioerror(bp, EFSCORRUPTED);
-        }
-}
-
-The code ensures that the CRC is only checked if the filesystem has CRCs enabled
-by checking the superblock of the feature bit, and then if the CRC verifies OK
-(or is not needed) it verifies the actual contents of the block.
-
-The verifier function will take a couple of different forms, depending on
-whether the magic number can be used to determine the format of the block. In
-the case it can't, the code is structured as follows:
-
-static bool
-xfs_foo_verify(
-	struct xfs_buf		*bp)
-{
-        struct xfs_mount	*mp = bp->b_target->bt_mount;
-        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
-
-        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
-                return false;
-
-        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
-		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
-			return false;
-		if (bp->b_bn != be64_to_cpu(hdr->blkno))
-			return false;
-		if (hdr->owner == 0)
-			return false;
-	}
-
-	/* object specific verification checks here */
-
-        return true;
-}
-
-If there are different magic numbers for the different formats, the verifier
-will look like:
-
-static bool
-xfs_foo_verify(
-	struct xfs_buf		*bp)
-{
-        struct xfs_mount	*mp = bp->b_target->bt_mount;
-        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
-
-        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
-		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
-			return false;
-		if (bp->b_bn != be64_to_cpu(hdr->blkno))
-			return false;
-		if (hdr->owner == 0)
-			return false;
-	} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
-		return false;
-
-	/* object specific verification checks here */
-
-        return true;
-}
-
-Write verifiers are very similar to the read verifiers, they just do things in
-the opposite order to the read verifiers. A typical write verifier:
-
-static void
-xfs_foo_write_verify(
-	struct xfs_buf	*bp)
-{
-	struct xfs_mount	*mp = bp->b_target->bt_mount;
-	struct xfs_buf_log_item	*bip = bp->b_fspriv;
-
-	if (!xfs_foo_verify(bp)) {
-		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
-		xfs_buf_ioerror(bp, EFSCORRUPTED);
-		return;
-	}
-
-	if (!xfs_sb_version_hascrc(&mp->m_sb))
-		return;
-
-
-	if (bip) {
-		struct xfs_ondisk_hdr	*hdr = bp->b_addr;
-		hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
-	}
-	xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
-}
-
-This will verify the internal structure of the metadata before we go any
-further, detecting corruptions that have occurred as the metadata has been
-modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
-update the LSN field (when it was last modified) and calculate the CRC on the
-metadata. Once this is done, we can issue the IO.
-
-Inodes and Dquots
------------------
-
-Inodes and dquots are special snowflakes. They have per-object CRC and
-self-identifiers, but they are packed so that there are multiple objects per
-buffer. Hence we do not use per-buffer verifiers to do the work of per-object
-verification and CRC calculations. The per-buffer verifiers simply perform basic
-identification of the buffer - that they contain inodes or dquots, and that
-there are magic numbers in all the expected spots. All further CRC and
-verification checks are done when each inode is read from or written back to the
-buffer.
-
-The structure of the verifiers and the identifiers checks is very similar to the
-buffer code described above. The only difference is where they are called. For
-example, inode read verification is done in xfs_iread() when the inode is first
-read out of the buffer and the struct xfs_inode is instantiated. The inode is
-already extensively verified during writeback in xfs_iflush_int, so the only
-addition here is to add the LSN and CRC to the inode as it is copied back into
-the buffer.
-
-XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
-the unlinked list modifications check or update CRCs, neither during unlink nor
-log recovery. So, it's gone unnoticed until now. This won't matter immediately -
-repair will probably complain about it - but it needs to be fixed.
-
diff --git a/Documentation/filesystems/xfs/ondisk/overview.rst b/Documentation/filesystems/xfs/ondisk/overview.rst
index ea745e181318..c96a65572e7b 100644
--- a/Documentation/filesystems/xfs/ondisk/overview.rst
+++ b/Documentation/filesystems/xfs/ondisk/overview.rst
@@ -42,3 +42,5 @@ filesystem operations can be carried out atomically in the case of a crash.
 Furthermore, there is the concept of a real-time device wherein allocations
 are tracked more simply and in larger chunks to reduce jitter in allocation
 latency.
+
+.. include:: self_describing_metadata.rst
diff --git a/Documentation/filesystems/xfs/ondisk/self_describing_metadata.rst b/Documentation/filesystems/xfs/ondisk/self_describing_metadata.rst
new file mode 100644
index 000000000000..2ce5a42deaf9
--- /dev/null
+++ b/Documentation/filesystems/xfs/ondisk/self_describing_metadata.rst
@@ -0,0 +1,402 @@
+.. SPDX-License-Identifier: CC-BY-SA-3.0+
+
+Metadata Integrity
+------------------
+
+Introduction
+~~~~~~~~~~~~
+
+The largest scalability problem facing XFS is not one of algorithmic
+scalability, but of verification of the filesystem structure. Scalabilty of
+the structures and indexes on disk and the algorithms for iterating them are
+adequate for supporting PB scale filesystems with billions of inodes, however
+it is this very scalability that causes the verification problem.
+
+Almost all metadata on XFS is dynamically allocated. The only fixed location
+metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
+other metadata structures need to be discovered by walking the filesystem
+structure in different ways. While this is already done by userspace tools for
+validating and repairing the structure, there are limits to what they can
+verify, and this in turn limits the supportable size of an XFS filesystem.
+
+For example, it is entirely possible to manually use xfs\_db and a bit of
+scripting to analyse the structure of a 100TB filesystem when trying to
+determine the root cause of a corruption problem, but it is still mainly a
+manual task of verifying that things like single bit errors or misplaced
+writes weren’t the ultimate cause of a corruption event. It may take a few
+hours to a few days to perform such forensic analysis, so for at this scale
+root cause analysis is entirely possible.
+
+However, if we scale the filesystem up to 1PB, we now have 10x as much
+metadata to analyse and so that analysis blows out towards weeks/months of
+forensic work. Most of the analysis work is slow and tedious, so as the amount
+of analysis goes up, the more likely that the cause will be lost in the noise.
+Hence the primary concern for supporting PB scale filesystems is minimising
+the time and effort required for basic forensic analysis of the filesystem
+structure.
+
+Therefore, the version 5 disk format introduced larger headers for all
+metadata types, which enable the filesystem to check information being read
+from the disk more rigorously. Metadata integrity fields now include:
+
+-  **Magic** numbers, to classify all types of metadata. This is unchanged
+   from v4.
+
+-  A copy of the filesystem **UUID**, to confirm that a given disk block is
+   connected to the superblock.
+
+-  The **owner**, to avoid accessing a piece of metadata which belongs to some
+   other part of the filesystem.
+
+-  The filesystem **block number**, to detect misplaced writes.
+
+-  The **log serial number** of the last write to this block, to avoid
+   replaying obsolete log entries.
+
+-  A CRC32c **checksum** of the entire block, to detect minor corruption.
+
+Metadata integrity coverage has been extended to all metadata blocks in the
+filesystem, with the following notes:
+
+-  Inodes can have multiple "owners" in the directory tree; therefore the
+   record contains the inode number instead of an owner or a block number.
+
+-  Superblocks have no owners.
+
+-  The disk quota file has no owner or block numbers.
+
+-  Metadata owned by files list the inode number as the owner.
+
+-  Per-AG data and B+tree blocks list the AG number as the owner.
+
+-  Per-AG header sectors don’t list owners or block numbers, since they have
+   fixed locations.
+
+-  Remote attribute blocks are not logged and therefore the LSN must be -1.
+
+This functionality enables XFS to decide that a block contents are so
+unexpected that it should stop immediately. Unfortunately checksums do not
+allow for automatic correction. Please keep regular backups, as always.
+
+Self Describing Metadata
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+One of the problems with the current metadata format is that apart from the
+magic number in the metadata block, we have no other way of identifying what
+it is supposed to be. We can’t even identify if it is the right place. Put
+simply, you can’t look at a single metadata block in isolation and say "yes,
+it is supposed to be there and the contents are valid".
+
+Hence most of the time spent on forensic analysis is spent doing basic
+verification of metadata values, looking for values that are in range (and
+hence not detected by automated verification checks) but are not correct.
+Finding and understanding how things like cross linked block lists (e.g.
+sibling pointers in a btree end up with loops in them) are the key to
+understanding what went wrong, but it is impossible to tell what order the
+blocks were linked into each other or written to disk after the fact.
+
+Hence we need to record more information into the metadata to allow us to
+quickly determine if the metadata is intact and can be ignored for the purpose
+of analysis. We can’t protect against every possible type of error, but we can
+ensure that common types of errors are easily detectable. Hence the concept of
+self describing metadata.
+
+The first, fundamental requirement of self describing metadata is that the
+metadata object contains some form of unique identifier in a well known
+location. This allows us to identify the expected contents of the block and
+hence parse and verify the metadata object. IF we can’t independently identify
+the type of metadata in the object, then the metadata doesn’t describe itself
+very well at all!
+
+Luckily, almost all XFS metadata has magic numbers embedded already - only the
+AGFL, remote symlinks and remote attribute blocks do not contain identifying
+magic numbers. Hence we can change the on-disk format of all these objects to
+add more identifying information and detect this simply by changing the magic
+numbers in the metadata objects. That is, if it has the current magic number,
+the metadata isn’t self identifying. If it contains a new magic number, it is
+self identifying and we can do much more expansive automated verification of
+the metadata object at runtime, during forensic analysis or repair.
+
+As a primary concern, self describing metadata needs some form of overall
+integrity checking. We cannot trust the metadata if we cannot verify that it
+has not been changed as a result of external influences. Hence we need some
+form of integrity check, and this is done by adding CRC32c validation to the
+metadata block. If we can verify the block contains the metadata it was
+intended to contain, a large amount of the manual verification work can be
+skipped.
+
+CRC32c was selected as metadata cannot be more than 64k in length in XFS and
+hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
+metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it
+is fast. So while CRC32c is not the strongest of possible integrity checks
+that could be used, it is more than sufficient for our needs and has
+relatively little overhead. Adding support for larger integrity fields and/or
+algorithms does really provide any extra value over CRC32c, but it does add a
+lot of complexity and so there is no provision for changing the integrity
+checking mechanism.
+
+Self describing metadata needs to contain enough information so that the
+metadata block can be verified as being in the correct place without needing
+to look at any other metadata. This means it needs to contain location
+information. Just adding a block number to the metadata is not sufficient to
+protect against mis-directed writes - a write might be misdirected to the
+wrong LUN and so be written to the "correct block" of the wrong filesystem.
+Hence location information must contain a filesystem identifier as well as a
+block number.
+
+Another key information point in forensic analysis is knowing who the metadata
+block belongs to. We already know the type, the location, that it is valid
+and/or corrupted, and how long ago that it was last modified. Knowing the
+owner of the block is important as it allows us to find other related metadata
+to determine the scope of the corruption. For example, if we have a extent
+btree object, we don’t know what inode it belongs to and hence have to walk
+the entire filesystem to find the owner of the block. Worse, the corruption
+could mean that no owner can be found (i.e. it’s an orphan block), and so
+without an owner field in the metadata we have no idea of the scope of the
+corruption. If we have an owner field in the metadata object, we can
+immediately do top down validation to determine the scope of the problem.
+
+Different types of metadata have different owner identifiers. For example,
+directory, attribute and extent tree blocks are all owned by an inode, whilst
+freespace btree blocks are owned by an allocation group. Hence the size and
+contents of the owner field are determined by the type of metadata object we
+are looking at. The owner information can also identify misplaced writes (e.g.
+freespace btree block written to the wrong AG).
+
+Self describing metadata also needs to contain some indication of when it was
+written to the filesystem. One of the key information points when doing
+forensic analysis is how recently the block was modified. Correlation of set
+of corrupted metadata blocks based on modification times is important as it
+can indicate whether the corruptions are related, whether there’s been
+multiple corruption events that lead to the eventual failure, and even whether
+there are corruptions present that the run-time verification is not detecting.
+
+For example, we can determine whether a metadata object is supposed to be free
+space or still allocated if it is still referenced by its owner by looking at
+when the free space btree block that contains the block was last written
+compared to when the metadata object itself was last written. If the free
+space block is more recent than the object and the object’s owner, then there
+is a very good chance that the block should have been removed from the owner.
+
+To provide this "written timestamp", each metadata block gets the Log Sequence
+Number (LSN) of the most recent transaction it was modified on written into
+it. This number will always increase over the life of the filesystem, and the
+only thing that resets it is running xfs\_repair on the filesystem. Further,
+by use of the LSN we can tell if the corrupted metadata all belonged to the
+same log checkpoint and hence have some idea of how much modification occurred
+between the first and last instance of corrupt metadata on disk and, further,
+how much modification occurred between the corruption being written and when
+it was detected.
+
+Runtime Validation
+~~~~~~~~~~~~~~~~~~
+
+Validation of self-describing metadata takes place at runtime in two places:
+
+-  immediately after a successful read from disk
+
+-  immediately prior to write IO submission
+
+The verification is completely stateless - it is done independently of the
+modification process, and seeks only to check that the metadata is what it
+says it is and that the metadata fields are within bounds and internally
+consistent. As such, we cannot catch all types of corruption that can occur
+within a block as there may be certain limitations that operational state
+enforces of the metadata, or there may be corruption of interblock
+relationships (e.g. corrupted sibling pointer lists). Hence we still need
+stateful checking in the main code body, but in general most of the per-field
+validation is handled by the verifiers.
+
+For read verification, the caller needs to specify the expected type of
+metadata that it should see, and the IO completion process verifies that the
+metadata object matches what was expected. If the verification process fails,
+then it marks the object being read as EFSCORRUPTED. The caller needs to catch
+this error (same as for IO errors), and if it needs to take special action due
+to a verification error it can do so by catching the EFSCORRUPTED error value.
+If we need more discrimination of error type at higher levels, we can define
+new error numbers for different errors as necessary.
+
+The first step in read verification is checking the magic number and
+determining whether CRC validating is necessary. If it is, the CRC32c is
+calculated and compared against the value stored in the object itself. Once
+this is validated, further checks are made against the location information,
+followed by extensive object specific metadata validation. If any of these
+checks fail, then the buffer is considered corrupt and the EFSCORRUPTED error
+is set appropriately.
+
+Write verification is the opposite of the read verification - first the object
+is extensively verified and if it is OK we then update the LSN from the last
+modification made to the object, After this, we calculate the CRC and insert
+it into the object. Once this is done the write IO is allowed to continue. If
+any error occurs during this process, the buffer is again marked with a
+EFSCORRUPTED error for the higher layers to catch.
+
+Structures
+~~~~~~~~~~
+
+A typical on-disk structure needs to contain the following information:
+
+.. code:: c
+
+    struct xfs_ondisk_hdr {
+            __be32  magic;      /* magic number */
+            __be32  crc;        /* CRC, not logged */
+            uuid_t  uuid;       /* filesystem identifier */
+            __be64  owner;      /* parent object */
+            __be64  blkno;      /* location on disk */
+            __be64  lsn;        /* last modification in log, not logged */
+    };
+
+Depending on the metadata, this information may be part of a header structure
+separate to the metadata contents, or may be distributed through an existing
+structure. The latter occurs with metadata that already contains some of this
+information, such as the superblock and AG headers.
+
+Other metadata may have different formats for the information, but the same
+level of information is generally provided. For example:
+
+-  short btree blocks have a 32 bit owner (ag number) and a 32 bit block
+   number for location. The two of these combined provide the same information
+   as @owner and @blkno in eh above structure, but using 8 bytes less space on
+   disk.
+
+-  directory/attribute node blocks have a 16 bit magic number, and the header
+   that contains the magic number has other information in it as well. hence
+   the additional metadata headers change the overall format of the metadata.
+
+A typical buffer read verifier is structured as follows:
+
+.. code:: c
+
+    #define XFS_FOO_CRC_OFF     offsetof(struct xfs_ondisk_hdr, crc)
+
+    static void
+    xfs_foo_read_verify(
+        struct xfs_buf  *bp)
+    {
+           struct xfs_mount *mp = bp->b_target->bt_mount;
+
+            if ((xfs_sb_version_hascrc(&mp->m_sb) &&
+                 !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
+                        XFS_FOO_CRC_OFF)) ||
+                !xfs_foo_verify(bp)) {
+                    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+                    xfs_buf_ioerror(bp, EFSCORRUPTED);
+            }
+    }
+
+The code ensures that the CRC is only checked if the filesystem has CRCs
+enabled by checking the superblock of the feature bit, and then if the CRC
+verifies OK (or is not needed) it verifies the actual contents of the block.
+
+The verifier function will take a couple of different forms, depending on
+whether the magic number can be used to determine the format of the block. In
+the case it can’t, the code is structured as follows:
+
+.. code:: c
+
+    static bool
+    xfs_foo_verify(
+        struct xfs_buf      *bp)
+    {
+            struct xfs_mount    *mp = bp->b_target->bt_mount;
+            struct xfs_ondisk_hdr   *hdr = bp->b_addr;
+
+            if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+                    return false;
+
+            if (!xfs_sb_version_hascrc(&mp->m_sb)) {
+            if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+                return false;
+            if (bp->b_bn != be64_to_cpu(hdr->blkno))
+                return false;
+            if (hdr->owner == 0)
+                return false;
+        }
+
+        /* object specific verification checks here */
+
+            return true;
+    }
+
+If there are different magic numbers for the different formats, the verifier
+will look like:
+
+.. code:: c
+
+    static bool
+    xfs_foo_verify(
+        struct xfs_buf      *bp)
+    {
+            struct xfs_mount    *mp = bp->b_target->bt_mount;
+            struct xfs_ondisk_hdr   *hdr = bp->b_addr;
+
+            if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
+            if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+                return false;
+            if (bp->b_bn != be64_to_cpu(hdr->blkno))
+                return false;
+            if (hdr->owner == 0)
+                return false;
+        } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+            return false;
+
+        /* object specific verification checks here */
+
+            return true;
+    }
+
+Write verifiers are very similar to the read verifiers, they just do things in
+the opposite order to the read verifiers. A typical write verifier:
+
+.. code:: c
+
+    static void
+    xfs_foo_write_verify(
+        struct xfs_buf  *bp)
+    {
+        struct xfs_mount    *mp = bp->b_target->bt_mount;
+        struct xfs_buf_log_item *bip = bp->b_fspriv;
+
+        if (!xfs_foo_verify(bp)) {
+            XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+            xfs_buf_ioerror(bp, EFSCORRUPTED);
+            return;
+        }
+
+        if (!xfs_sb_version_hascrc(&mp->m_sb))
+            return;
+
+
+        if (bip) {
+            struct xfs_ondisk_hdr   *hdr = bp->b_addr;
+            hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
+        }
+        xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
+    }
+
+This will verify the internal structure of the metadata before we go any
+further, detecting corruptions that have occurred as the metadata has been
+modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
+update the LSN field (when it was last modified) and calculate the CRC on the
+metadata. Once this is done, we can issue the IO.
+
+Inodes and Dquots
+~~~~~~~~~~~~~~~~~
+
+Inodes and dquots are special snowflakes. They have per-object CRC and
+self-identifiers, but they are packed so that there are multiple objects per
+buffer. Hence we do not use per-buffer verifiers to do the work of per-object
+verification and CRC calculations. The per-buffer verifiers simply perform
+basic identification of the buffer - that they contain inodes or dquots, and
+that there are magic numbers in all the expected spots. All further CRC and
+verification checks are done when each inode is read from or written back to
+the buffer.
+
+The structure of the verifiers and the identifiers checks is very similar to
+the buffer code described above. The only difference is where they are called.
+For example, inode read verification is done in xfs\_iread() when the inode is
+first read out of the buffer and the struct xfs\_inode is instantiated. The
+inode is already extensively verified during writeback in xfs\_iflush\_int, so
+the only addition here is to add the LSN and CRC to the inode as it is copied
+back into the buffer.




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux