Re: Fix(es) for ext2 fsync bug

Theodore Tso <tytso@xxxxxxx> · Thu, 15 Feb 2007 09:20:21 -0500

On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote:
> Background: The eXplode file system checker found a bug in ext2 fsync
> behavior.  Do the following: truncate file A, create file B which
> reallocates one of A's old indirect blocks, fsync file B.  If you then
> crash before file A's metadata is all written out, fsck will complete
> the truncate for file A... thereby deleting file B's data.  So fsync
> file B doesn't guarantee data is on disk after a crash.  Details:

It's actually not the case that fsck will complete the truncate for
file A.  The problem is that while e2fsck is processing indirect
blocks in pass 1, the block which is marked as file A's indirect block
(but which actually contain's file B's data) gets "fixed" when e2fsck
sees block numbers which look like illegal block numbers.  So this
ends up corrupting file B's data.

This is actually legal end result, BTW, since it's POSIX states the
result of fsync() is undefined if the system crashes.  Technically
fsync() did actually guarantee that file B's data is "on disk"; the
problem is that e2fsck would corrupt the data afterwards.  Ironically,
fsync()'ing file B actually makes it more likely that it might get
corrupted afterwards, since normally filesystem metadata gets sync'ed
out on 5 second intervals, while data gets sync'ed out at 30 second
intervals.

> * Rearrange order of duplicate block checking and fixing file size in
>   fsck.  Not sure how hard this is. (Ted?)

It's not a matter of changing when we deal with fixing the file size,
as described above.  At the fsck time, we would need to keep backup
copies of any indirect blocks that get modified for whatever reason,
and then in pass 1D, when we clone a block that has been claimed by
multiple inods, the inodes which claim the block as a data block
should get a copy of the block before it was modified by e2fsck.

> * Keep a set of "still allocated on disk" block bitmaps that gets
>   flushed whenever a sync happens.  Don't allocate these blocks.
>   Journaling file systems already have to do this.

A list would be more efficient, as others have pointed out.  That
would work, although the knowing when entries could be removed from
the list.  The machinery for knowing when metadata has been updated
isn't present in ext2, and that's a fair amount of complexity.  You
could clear the list/bitmap after the 5 second metadata flush command
has been kicked off, or if you associate a data block with the
previous inode's owner, you could clear the entry when the inode's
dirty bit has been cleared, but that doesn't completely get rid of the
race unless you tie it to when the write has completed (and this
assumes write barriers to make sure the block was actually flushed to
the media).

Another very heavyweight approach would be to simply force a full sync
of the filesystem whenever fysnc() is called.  Not pretty, and without
the proper write ordering, the race is still potentially there.

I'd say that the best way to handle this is in fsck, but quite frankly
it's relatively low priority "bug" to handle, since a much simpler
workaround is to tell people to use ext3 instead.

Regards,

						- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html