Re: data corrupting bug in 2.4.20 ext3, data=journal

"Stephen C. Tweedie" <sct@redhat.com> · 02 Dec 2002 15:37:22 +0000

On Mon, 2002-12-02 at 12:15, Stephen C. Tweedie wrote:

> The problem is that ext3 is expecting that truncate_inode_pages() (and
> hence ext3_flushpage) is only called during a truncate.  That's what the
> function is named for, after all, and it's the hint we need to indicate
> that future writeback on the data we're discarding should be disabled
> (so that we don't get old data written on top of new data should the
> block get deallocated.)
> 
> But kill_supers() eventually calls truncate_inode_pages() too when we're
> doing the invalidate_inodes().

But we've already called fsync_super() at this point.  If that completes
synchronously, then the buffers will already be out of the journal and
queued for writeback when we get here, and clearing BH_JBDDirty won't
have any dire consequences.

Indeed, loading the ext3 module with "do_sync_supers=1" cures the
symptoms.

However, doing every superblock write synchronously is a non-starter, as
that requires a journal commit in the ext3 universe, and so this would
essentially force bdflush to serialise all of its filesystem flushes
(which is a real problem once you've got a significant number of
filesystems mounted.)  But the VFS simply doesn't have any clean way of
telling foo_write_super() whether the call needs to be completed
synchronously or asynchronously.

There *is* a totally unclean way, though.  kill_super() sets sb->s_root
to NULL before calling its final fsync_super(), and ext3 can easily
check that in ext3_write_super().  It's a nasty, messy dependency, but
should work for 2.4 at least.  For 2.5 we probably want to look at
passing an async flag down into the write_super.

The attached patch seems to fix things for me.

Cheers,
 Stephen

--- linux-2.4-ext3merge/fs/ext3/super.c.=K0027=.orig	2002-12-02 15:35:13.000000000 +0000
+++ linux-2.4-ext3merge/fs/ext3/super.c	2002-12-02 15:35:14.000000000 +0000
@@ -1640,7 +1640,12 @@
 	sb->s_dirt = 0;
 	target = log_start_commit(EXT3_SB(sb)->s_journal, NULL);
 
-	if (do_sync_supers) {
+	/*
+	 * Tricky --- if we are unmounting, the write really does need
+	 * to be synchronous.  We can detect that by looking for NULL in
+	 * sb->s_root.
+	 */
+	if (do_sync_supers || !sb->s_root) {
 		unlock_super(sb);
 		log_wait_commit(EXT3_SB(sb)->s_journal, target);
 		lock_super(sb);