On 10/27/12 4:19 PM, Eric Sandeen wrote: > On 10/27/12 1:47 PM, Nix wrote: >> On 27 Oct 2012, Theodore Ts'o said: >> >>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote: >>>> Ah! it's turned on by journal_async_commit. OK, that alone argues >>>> against use of journal_async_commit, tested or not, and I'd not have >>>> turned it on if I'd noticed that. >>>> >>>> (So, the combinations I'll be trying for effect on this bug are: >>>> >>>> journal_async_commit (as now) >>>> journal_checksum >>>> none >>> >>> Can you also check and see whether the presence or absence of >>> "nobarrier" makes a difference? >> >> Done. (Also checked the effect of your patches posted earlier this week: >> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test >> I was carrying out, umount -l'ing /var as the very last thing I did >> before /sbin/reboot -f.) >> >> nobarrier makes a difference that I, at least, did not expect: >> >> [no options] No corruption >> >> nobarrier No corruption >> >> journal_checksum Corruption >> Corrupted transaction, journal aborted >> >> nobarrier,journal_checksum Corruption >> Corrupted transaction, journal aborted >> >> journal_async_commit Corruption >> Corrupted transaction, journal aborted >> >> nobarrier,journal_async_commit Corruption >> No corrupted transaction or aborted journal > > That's what we needed. Woulda been great a few days ago ;) > > In my testing journal_checksum is broken, and my bisection seems to > implicate > > commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 > Author: Theodore Ts'o <tytso@xxxxxxx> > Date: Mon Feb 6 20:12:03 2012 -0500 > > ext4: fold ext4_claim_inode into ext4_new_inode > > as the culprit. I haven't had time to look into why, yet. It looks like the inode_bitmap_bh is being modified outside a transaction: ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data); It needs a get_write_access / handle_dirty_metadata pair around it. A hacky patch like this seems to work but it was done 5mins before I have to be out the door to dinner so it probably needs cleanup or at least scrutiny. [PATCH] ext4: fix unjournaled inode bitmap modification commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function such that the inode bitmap was being modified outside a transaction, which could lead to corruption, and was discovered when journal_checksum found a bad checksum in the journal. Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx> --- --- ialloc.c.reverted2 2012-10-27 17:31:20.351537073 -0500 +++ ialloc.c 2012-10-27 17:40:18.643553576 -0500 @@ -669,6 +669,10 @@ inode_bitmap_bh = ext4_read_inode_bitmap(sb, group); if (!inode_bitmap_bh) goto fail; + BUFFER_TRACE(inode_bitmap_bh, "get_write_access"); + err = ext4_journal_get_write_access(handle, inode_bitmap_bh); + if (err) + goto fail; repeat_in_this_group: ino = ext4_find_next_zero_bit((unsigned long *) @@ -690,6 +694,10 @@ ino++; /* the inode bitmap is zero-based */ if (!ret2) goto got; /* we grabbed the inode! */ + BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); + err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh); + if (err) + goto fail; if (ino < EXT4_INODES_PER_GROUP(sb)) goto repeat_in_this_group; } > -Eric > >> I didn't expect the last case at all, and it adequately explains why you >> are mostly seeing corrupted journal messages in your tests but I was >> not. It also explains why when I saw this for the first time I was able >> to mount the resulting corrupted filesystem read-write and corrupt it >> further before I noticed that anything was wrong. >> >> It is also clear that journal_checksum and all that relies on it is >> worse than useless right now, as Eric reported while I was testing this. >> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable >> kernels, if CONFIG_BROKEN existed anymore, which it doesn't. >> >> It's a shame journal_async_commit depends on a broken feature: it might >> be notionally unsafe but on some of my systems (without nobarrier or >> flashy caching controllers) it was associated with a noticeable speedup >> of metadata-heavy workloads -- though that was way back in 2009... >> however, "safety first" definitely applies in this case. >> > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html