On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: > Background: The eXplode file system checker found a bug in ext2 fsync > behavior. Do the following: truncate file A, create file B which > reallocates one of A's old indirect blocks, fsync file B. If you then > crash before file A's metadata is all written out, fsck will complete > the truncate for file A... thereby deleting file B's data. So fsync > file B doesn't guarantee data is on disk after a crash. Details: It's actually not the case that fsck will complete the truncate for file A. The problem is that while e2fsck is processing indirect blocks in pass 1, the block which is marked as file A's indirect block (but which actually contain's file B's data) gets "fixed" when e2fsck sees block numbers which look like illegal block numbers. So this ends up corrupting file B's data. This is actually legal end result, BTW, since it's POSIX states the result of fsync() is undefined if the system crashes. Technically fsync() did actually guarantee that file B's data is "on disk"; the problem is that e2fsck would corrupt the data afterwards. Ironically, fsync()'ing file B actually makes it more likely that it might get corrupted afterwards, since normally filesystem metadata gets sync'ed out on 5 second intervals, while data gets sync'ed out at 30 second intervals. > * Rearrange order of duplicate block checking and fixing file size in > fsck. Not sure how hard this is. (Ted?) It's not a matter of changing when we deal with fixing the file size, as described above. At the fsck time, we would need to keep backup copies of any indirect blocks that get modified for whatever reason, and then in pass 1D, when we clone a block that has been claimed by multiple inods, the inodes which claim the block as a data block should get a copy of the block before it was modified by e2fsck. > * Keep a set of "still allocated on disk" block bitmaps that gets > flushed whenever a sync happens. Don't allocate these blocks. > Journaling file systems already have to do this. A list would be more efficient, as others have pointed out. That would work, although the knowing when entries could be removed from the list. The machinery for knowing when metadata has been updated isn't present in ext2, and that's a fair amount of complexity. You could clear the list/bitmap after the 5 second metadata flush command has been kicked off, or if you associate a data block with the previous inode's owner, you could clear the entry when the inode's dirty bit has been cleared, but that doesn't completely get rid of the race unless you tie it to when the write has completed (and this assumes write barriers to make sure the block was actually flushed to the media). Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority "bug" to handle, since a much simpler workaround is to tell people to use ext3 instead. Regards, - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html