On Jun 07, 2006 11:10 +0100, Stephen C. Tweedie wrote: > On Mon, 2006-05-29 at 10:09 -0600, Andreas Dilger wrote: > > This is one thing that we have been thinking of for ext3. Instead of a > > filesystem-wide "error" bit we could move this per-group to only mark > > the block or inode bitmaps in error if they have a checksum failure. > > This would prevent allocations from that group to avoid further potential > > corruption of the filesystem metadata. > > Trouble is, individual files can span multiple groups easily. And one > of the common failure modes is failure in the indirect tree. What > action do you take if you detect that? Return an IO error for that part of the file? We already refuse to free file blocks that overlap with filesystem metadata, but have no way to know whether the rest of the blocks are valid or not. > There is fundamentally a large difference between the class of errors > that can arise due to EIO --- simple loss of a block of data --- and > those which can arise from actual corrupt data/metadata. If we detect > the latter and attempt to soldier on regardless, then we have no idea > what inconsistencies we are allowing to be propagated through the > filesystem. Recall that one of the other goals is to add checksumming to the extent tree metadata (if it isn't already covered by the inode checksum). Even today, the fact that the extent format has a magic allows some types of corruption to be detected. The structure is also somewhat verifiable (e.g. logical extent offsets are increasing, logical_offset + length is non-overlapping with next logical offset, etc) even without checksums. The proposed ext3_extent_tail would also contain an inode+generation back-reference and the checksum would depend on the physical block location so if one extent index block were incorrectly written in the place of another, or the higher-level reference were corrupted this would also be detectable. struct ext3_extent_tail { __u64 et_inum; __u32 et_igeneration; __u32 et_checksum; } > That can easily end up corrupting files far from the actual error. Say > an indirect block is corrupted; we delete that file, and end up freeing > a block belonging to some other file on a distant block group. Ooops. > Once that other block gets reallocated and overwritten, we have > corrupted that other file. Oh, I totally agree with that, which is another reason why I've proposed the "block mapped extent" several times. It would be referenced from an extent index block or inode, would start with an extent header to verify that this is at least semi-plausible block pointers, and can optionally have an ext3_extent_tail to validate the block data itself. The block-mapped extent is useful for fragmented files or files with lots of small holes in them. Concievably it would also be possible to quickly remap old block-mapped (indirect tree) files to bm-extent files if this was desirable. > *That* is why taking the fs down/readonly on failure is the safe option. And wait 17 years for e2fsck to complete? While I agree it is the safest option, sometimes it is necessary to just block off parts of the filesystem from writes and soldier on until the system can be taken down safely. > The inclusion of checksums would certainly allow us to harden things. > In the above scenario, failure of the checksum test would allow us to > discard corrupt indirect blocks before we could allow any harm to come > to other disk blocks. But that only works for cases where the checksum > notices the problem; if we're talking about possible OS bugs, memory > corruption etc. then it is quite possible to get corruption in the in- > memory copy, which gets properly checksummed and written to disk, so you > can't rely on that catching all cases. I agree, we can't ever handle everything unless we get checksums from the top of linux to the bottom (maybe stored in the page table?), but we can at least do the best we can. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html