Most illuminating, thank you :) Allow me to add a few more nitpicks. Hans-Peter Jansen wrote: > Hi Dave, > > On Freitag, 5. April 2013 18:00:06 Dave Chinner wrote: >> xfs: add metadata CRC documentation >> >> From: Dave Chinner <dchinner@xxxxxxxxxx> >> >> Add some documentation about the self describing metadata and the >> code templates used to implement it. > > Nice text. This is the coolest addition to XFS since invention of sliced bread. > > One question arose from reading: since only the metadata is protected, any > corruption of data blocks (file content) will still go unnoticed, does it? > > Allow me to propose some minor corrections (from the nitpick department..). > >> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> >> --- >> .../filesystems/xfs-self-describing-metadata.txt | 352 ++++++++++++++++++++ >> 1 file changed, 352 insertions(+) >> >> diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt >> new file mode 100644 >> index 0000000..da7edc9 >> --- /dev/null >> +++ b/Documentation/filesystems/xfs-self-describing-metadata.txt >> @@ -0,0 +1,352 @@ >> +XFS Self Describing Metadata >> +---------------------------- >> + >> +Introduction >> +------------ >> + >> +The largest scalability problem facing XFS is not one of algorithmic >> +scalability, but of verification of the filesystem structure. Scalabilty of the >> +structures and indexes on disk and the algorithms for iterating them are >> +adequate for supporting PB scale filesystems with billions of inodes, however it >> +is this very scalability that causes the verification problem. >> + >> +Almost all metadata on XFS is dynamically allocated. The only fixed location >> +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all >> +other metadata structures need to be discovered by walking the filesystem >> +structure in different ways. While this is already done by userspace tools for >> +validating and repairing the structure, there are limits to what they can >> +verify, and this in turn limits the supportable size of an XFS filesystem. >> + >> +For example, it is entirely possible to manually use xfs_db and a bit of >> +scripting to analyse the structure of a 100TB filesystem when trying to >> +determine the root cause of a corruption problem, but it is still mainly a >> +manual task of verifying that things like single bit errors or misplaced writes >> +weren't the ultimate cause of a corruption event. It may take a few hours to a >> +few days to perform such forensic analysis, so for at this scale root cause >> +analysis is entirely possible. >> + >> +However, if we scale the filesystem up to 1PB, we now have 10x as much metadata >> +to analyse and so that analysis blows out towards weeks/months of forensic work. >> +Most of the analysis work is slow and tedious, so as the amount of analysis goes >> +up, the more likely that the cause will be lost in the noise. Hence the primary >> +concern for supporting PB scale filesystems is minimising the time and effort >> +required for basic forensic analysis of the filesystem structure. >> + >> + >> +Self Describing Metadata >> +------------------------ >> + >> +One of the problems with the current metadata format is that apart from the >> +magic number in the metadata block, we have no other way of identifying what it >> +is supposed to be. We can't even identify if it is the right place. Put simply, >> +you can't look at a single metadata block in isolation and say "yes, it is >> +supposed to be there and the contents are valid". >> + >> +Hence most of the time spent on forensic analysis is spent doing basic >> +verification of metadata values, looking for values that are in range (and hence >> +not detected by automated verification checks) but are not correct. Finding and >> +understanding how things like cross linked block lists (e.g. sibling >> +pointers in a btree end up with loops in them) are the key to understanding what >> +went wrong, but it is impossible to tell what order the blocks were linked into >> +each other or written to disk after the fact. >> + >> +Hence we need to record more information into the metadata to allow us to >> +quickly determine if the metadata is intact and can be ignored for the purpose >> +of analysis. We can't protect against every possible type of error, but we can >> +ensure that common types of errors are easily detectable. Hence the concept of >> +self describing metadata. >> + >> +The first, fundamental requirement of self describing metadata is that the >> +metadata object contains some form of unique identifier in a well known >> +location. This allows us to identify the expected contents of the block and >> +hence parse and verify the metadata object. IF we can't independently identify >> +the type of metadata in the object, then the metadata doesn't describe itself >> +very well at all! >> + >> +Luckily, almost all XFS metadata has magic numbers embedded already - only the >> +AGFL, remote symlinks and remote attribute blocks do not contain identifying >> +magic numbers. Hence we can change the on-disk format of all these objects to >> +add more identifying information and detect this simply by changing the magic >> +numbers in the metadata objects. That is, if it has the current magic number, >> +the metadata isn't self identifying. If it contains a new magic number, it is >> +self identifying and we can do much more expansive automated verification of the >> +metadata object at runtime, during forensic analysis or repair. >> + >> +As a primary concern, self describing metadata needs to some form of overall > > ^^ scratch that > >> +integrity checking. We cannot trust the metadata if we cannot verify that it has >> +not been changed as a result of external influences. Hence we need some form of >> +integrity check, and this is done by adding CRC32c validation to the metadata >> +block. If we can verify the block contains the metadata it was intended to >> +contain, a large amount of the manual verification work can be skipped. >> + >> +CRC32c was selected as metadata cannot be more than 64k in length in XFS and >> +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in >> +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is >> +fast. So while CRC32c is not the strongest of integrity checks that could be > > ^ possible (perhaps) > >> +used, it is more than sufficient for our needs and has relatively little >> +overhead. Adding support for larger integrity fields and/or algorithms does > > n't > >> +really provide any extra value over CRC32c, but it does add a lot of complexity >> +and so there is no provision for changing the integrity checking mechanism. >> + >> +Self describing metadata needs to contain enough information so that the >> +metadata block can be verified as being in the correct place without needing to >> +look at any other metadata. This means it needs to contain location information. >> +Just adding a block number to the metadata is not sufficient to protect against >> +mis-directed writes - a write might be misdirected to the wrong LUN and so be >> +written to the "correct block" of the wrong filesystem. Hence location >> +information must contain a filesystem identifier as well as a block number. >> + >> +Another key information point in forensic analysis is knowing who the metadata >> +block belongs to. We already know it's type, it's location, that it's valid > > shouldn't this spelled: its its > >> +and/or corrupted, and how long ago that it was last modified. Knowing the owner >> +of the block is important as it allows us to find other related metadata to >> +determine the scope of the corruption. For example, if we have a extent btree >> +object, we don't know what inode it belongs to and hence have to walk the entire >> +filesystem to find the owner of the block. Worse, the corruption could mean that >> +no owner can be found (i.e. it's an orphan block), and so without an owner field >> +in the metadata we have no idea of the scope of the corruption. If we have an >> +owner field in the metadata object, we can immediately do top down validation to >> +determine the scope of the problem. >> + >> +Different types of metadata have different owner identifiers. For example, >> +directory, attribute and extent tree blocks are all owned by an inode, whilst >> +freespace btree blocks are owned by an allocation group. Hence the size and >> +contents of the owner field are determined by the type of metadata object we are >> +looking at. For example, directories, extent maps and attributes are owned by >> +inodes, while freespace btree blocks are owned by a specific allocation group. >> +THe owner information can also identify misplaced writes (e.g. freespace btree > > The > >> +block written to the wrong AG). >> + >> +Self describing metadata also needs to contain some indication of when it was >> +written to the filesystem. One of the key information points when doing forensic >> +analysis is how recently the block was modified. Correlation of set of corrupted >> +metadata blocks based on modification times is important as it can indicate >> +whether the corruptions are related, whether there's been multiple corruption >> +events that lead to the eventual failure, and even whether there are corruptions >> +present that the run-time verification is not detecting. >> + >> +For example, we can determine whether a metadata object is supposed to be free >> +space or still allocated when it is still referenced by it's owner can be > > its allocated. When > >> +determined by looking at when the free space btree block that contains the block >> +was last written compared to when the metadata object itself was last written. >> +If the free space block is more recent than the object and the objects owner, object's >> +then there is a very good chance that the block should have been removed from >> +it's owner. its >> + >> +To provide this "written timestamp", each metadata block gets the Log Sequence >> +Number (LSN) of the most recent transaction it was modified on written into it. >> +This number will always increase over the life of the filesystem, and the only >> +thing that resets it is running xfs_repair on the filesystem. Further, by use of >> +the LSN we can tell if the corrupted metadata all belonged to the same log >> +checkpoint and hence have some idea of how much modification occurred between >> +the first and last instance of corrupt metadata on disk and, further, how much >> +modification occurred between the corruption being written and when it was >> +detected. >> + >> +Runtime Validation >> +------------------ >> + >> +Validation of self-describing metadata takes place at runtime in two places: >> + >> + - immediately after a successful read from disk >> + - immediately prior to write IO submission >> + >> +The verification is completely stateless - it is done independently of the >> +modification process, and seeks only to check that the metadata is what it says >> +it is and that the metadata fields are within bounds and internally consistent. >> +As such, we cannot catch all types of corruption that can occur within a block >> +as there may be certain limitations that operational state enforces of the >> +metadata, or there may be corruption of interblock relationships (e.g. corrupted >> +sibling pointer lists). Hence we still need stateful checking in the main code >> +body, but in general most of the per-field validation is handled by the >> +verifiers. >> + >> +For read verification, the caller needs to specify the expected type of metadata >> +that it should see, and the IO completion process verifies that the metadata >> +object matches what was expected. If the verification process fails, then it >> +marks the object being read as EFSCORRUPTED. The caller needs to catch this >> +error (same as for IO errors), and if it needs to take special action due to a >> +verification error it can do so by catching the EFSCORRUPTED error value. If we >> +need more discrimination of error type at higher levels, we can define new >> +error numbers for different errors as necessary. >> + >> +The first step in read verification is checking the magic number and determining >> +whether CRC validating is necessary. If it is, the CRC32c is caluclated and > > cu > >> +compared against the value stored in the object itself. Once this is validated, >> +further checks are made against the location information, followed by extensive >> +object specific metadata validation. If any of these checks fail, then the >> +buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. >> + >> +Write verification is the opposite of the read verification - first the object >> +is extensively verified and if it is OK we then update the LSN from the last >> +modification made to the object, After this, we calculate the CRC and insert it >> +into the object. Once this is done the write IO is allowed to continue. If any >> +error occurs during this process, the buffer is again marked with a EFSCORRUPTED >> +error for the higher layers to catch. >> + >> +Structures >> +---------- >> + >> +A typical on-disk structure needs to contain the following information: >> + >> +struct xfs_ondisk_hdr { >> + __be32 magic; /* magic number */ >> + __be32 crc; /* CRC, not logged */ >> + uuid_t uuid; /* filesystem identifier */ >> + __be64 owner; /* parent object */ >> + __be64 blkno; /* location on disk */ >> + __be64 lsn; /* last modification in log, not logged */ >> +}; >> + >> +Depending on the metadata, this information may be part of a header stucture > > structure > >> +separate to the metadata contents, or may be distributed through an existing >> +structure. The latter occurs with metadata that already contains some of this >> +information, such as the superblock and AG headers. >> + >> +Other metadata may have different formats for the information, but the same >> +level of information is generally provided. For example: >> + >> + - short btree blocks have a 32 bit owner (ag number) and a 32 bit block >> + number for location. The two of these combined provide the same >> + information as @owner and @blkno in eh above structure, but using 8 the >> + bytes less space on disk. >> + >> + - directory/attribute node blocks have a 16 bit magic number, and the >> + header that contains the magic number has other information in it as >> + well. hence the additional metadata headers change the overall format >> + of the metadata. >> + >> +A typical buffer read verifier is structured as follows: >> + >> +#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) >> + >> +static void >> +xfs_foo_read_verify( >> + struct xfs_buf *bp) >> +{ >> + struct xfs_mount *mp = bp->b_target->bt_mount; >> + >> + if ((xfs_sb_version_hascrc(&mp->m_sb) && >> + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), >> + XFS_FOO_CRC_OFF)) || >> + !xfs_foo_verify(bp)) { >> + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); >> + xfs_buf_ioerror(bp, EFSCORRUPTED); >> + } >> +} >> + >> +The code ensures that the CRC is only checked if the filesystem has CRCs enabled >> +by checking the superblock of the feature bit, and then if the CRC verifies OK >> +(or is not needed) it then verifies the actual contents of the block. > > ^^^^ scratch then perhaps > >> + >> +The verifier function will take a couple of different forms, depending on >> +whether the magic number can be used to determine the format of the block. In >> +the case it can't, the code will is structured as follows: scratch will ^^^^ >> + >> +static bool >> +xfs_foo_verify( >> + struct xfs_buf *bp) >> +{ >> + struct xfs_mount *mp = bp->b_target->bt_mount; >> + struct xfs_ondisk_hdr *hdr = bp->b_addr; >> + >> + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) >> + return false; >> + >> + if (!xfs_sb_version_hascrc(&mp->m_sb)) { >> + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) >> + return false; >> + if (bp->b_bn != be64_to_cpu(hdr->blkno)) >> + return false; >> + if (hdr->owner == 0) >> + return false; >> + } >> + >> + /* object specific verification checks here */ >> + >> + return true; >> +} >> + >> +If there are different magic numbers for the different formats, the verifier >> +will look like: >> + >> +static bool >> +xfs_foo_verify( >> + struct xfs_buf *bp) >> +{ >> + struct xfs_mount *mp = bp->b_target->bt_mount; >> + struct xfs_ondisk_hdr *hdr = bp->b_addr; >> + >> + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { >> + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) >> + return false; >> + if (bp->b_bn != be64_to_cpu(hdr->blkno)) >> + return false; >> + if (hdr->owner == 0) >> + return false; >> + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) >> + return false; >> + >> + /* object specific verification checks here */ >> + >> + return true; >> +} >> + >> +Write verifiers are very similar to the read verifiers, they just do things in >> +the opposite order to the read verifiers. A typical write verifier: >> + >> +static void >> +xfs_foo_write_verify( >> + struct xfs_buf *bp) >> +{ >> + struct xfs_mount *mp = bp->b_target->bt_mount; >> + struct xfs_buf_log_item *bip = bp->b_fspriv; >> + >> + if (!xfs_foo_verify(bp)) { >> + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); >> + xfs_buf_ioerror(bp, EFSCORRUPTED); >> + return; >> + } >> + >> + if (!xfs_sb_version_hascrc(&mp->m_sb)) >> + return; >> + >> + >> + if (bip) { >> + struct xfs_ondisk_hdr *hdr = bp->b_addr; >> + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); >> + } >> + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); >> +} >> + >> +This will verify the internal structure of the metadata before we go any >> +further, detecting corruptions that have occurred as the metadata has been >> +modified in memory. If the metadata verifies OK, and CRCs are enabled, we then >> +update the LSN field (when it was last modified) and calculate the CRC on the >> +metadata. Once this is done, we can issue the IO. >> + >> +Inodes and Dquots >> +----------------- >> + >> +Inodes and dquots are special snowflakes. They have per-object CRC and >> +self-identifiers, but they are packed so that there are multiple objects per >> +buffer. Hence we do not use per-buffer verifiers to do the work of per-object >> +verification and CRC calculations. The per-buffer verifiers simply perform basic >> +identification of the buffer - that they contain inodes or dquots, and that >> +there are magic numbers in all the expected spots. All further CRC and >> +verification checks are done when each inode is read from or written back to the >> +buffer. >> + >> +The structure of the verifiers and the identifiers checks is very similar to the >> +buffer code described above. The only difference is where they are called. For >> +example, inode read verification is done in xfs_iread() when the inode is first >> +read out of the buffer and the struct xfs_inode is instantiated. The inode is >> +already extensively verified during writeback in xfs_iflush_int, so the only >> +addition here add the LSN and CRC to the inode as it is copied back into the > ^ > is to > >> +buffer. >> + >> +XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of >> +the unlinked list modifications check or update CRCs, neither during unlink nor >> +log recovery. So, it's gone unnoticed until now. This won't matter immediately - >> +repair will probably complain about it - but it needs to be fixed. >> + > > Cheers, > Pete Cheers, Dave _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs