Re: topics for the file system mini-summit

Andreas Dilger <adilger@xxxxxxxxxxxxx> · Wed, 7 Jun 2006 12:55:23 -0600

On Jun 07, 2006  11:10 +0100, Stephen C. Tweedie wrote:
> On Mon, 2006-05-29 at 10:09 -0600, Andreas Dilger wrote:
> > This is one thing that we have been thinking of for ext3.  Instead of a
> > filesystem-wide "error" bit we could move this per-group to only mark
> > the block or inode bitmaps in error if they have a checksum failure.
> > This would prevent allocations from that group to avoid further potential
> > corruption of the filesystem metadata.
> 
> Trouble is, individual files can span multiple groups easily.  And one
> of the common failure modes is failure in the indirect tree.  What
> action do you take if you detect that?

Return an IO error for that part of the file?  We already refuse to
free file blocks that overlap with filesystem metadata, but have no
way to know whether the rest of the blocks are valid or not.

> There is fundamentally a large difference between the class of errors
> that can arise due to EIO --- simple loss of a block of data --- and
> those which can arise from actual corrupt data/metadata.  If we detect
> the latter and attempt to soldier on regardless, then we have no idea
> what inconsistencies we are allowing to be propagated through the
> filesystem.  

Recall that one of the other goals is to add checksumming to the extent
tree metadata (if it isn't already covered by the inode checksum).  Even
today, the fact that the extent format has a magic allows some types of
corruption to be detected.  The structure is also somewhat verifiable 
(e.g. logical extent offsets are increasing, logical_offset + length is
non-overlapping with next logical offset, etc) even without checksums.

The proposed ext3_extent_tail would also contain an inode+generation
back-reference and the checksum would depend on the physical block
location so if one extent index block were incorrectly written in the
place of another, or the higher-level reference were corrupted this
would also be detectable.

        struct ext3_extent_tail {
		__u64   et_inum;
		__u32   et_igeneration;
		__u32   et_checksum;
	}

> That can easily end up corrupting files far from the actual error.  Say
> an indirect block is corrupted; we delete that file, and end up freeing
> a block belonging to some other file on a distant block group.  Ooops.
> Once that other block gets reallocated and overwritten, we have
> corrupted that other file.

Oh, I totally agree with that, which is another reason why I've proposed
the "block mapped extent" several times.  It would be referenced from
an extent index block or inode, would start with an extent header to
verify that this is at least semi-plausible block pointers, and can
optionally have an ext3_extent_tail to validate the block data itself.

The block-mapped extent is useful for fragmented files or files with
lots of small holes in them.  Concievably it would also be possible
to quickly remap old block-mapped (indirect tree) files to bm-extent
files if this was desirable.

> *That* is why taking the fs down/readonly on failure is the safe option.

And wait 17 years for e2fsck to complete?  While I agree it is the
safest option, sometimes it is necessary to just block off parts of the
filesystem from writes and soldier on until the system can be taken down
safely.

> The inclusion of checksums would certainly allow us to harden things.
> In the above scenario, failure of the checksum test would allow us to
> discard corrupt indirect blocks before we could allow any harm to come
> to other disk blocks.  But that only works for cases where the checksum
> notices the problem; if we're talking about possible OS bugs, memory
> corruption etc. then it is quite possible to get corruption in the in-
> memory copy, which gets properly checksummed and written to disk, so you
> can't rely on that catching all cases.

I agree, we can't ever handle everything unless we get checksums from the
top of linux to the bottom (maybe stored in the page table?), but we can
at least do the best we can.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html