Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

Eric Sandeen <sandeen@xxxxxxxxxx> · Sat, 27 Oct 2012 17:42:07 -0500

On 10/27/12 4:19 PM, Eric Sandeen wrote:
> On 10/27/12 1:47 PM, Nix wrote:
>> On 27 Oct 2012, Theodore Ts'o said:
>>
>>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>>>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>>>> against use of journal_async_commit, tested or not, and I'd not have
>>>> turned it on if I'd noticed that.
>>>>
>>>> (So, the combinations I'll be trying for effect on this bug are:
>>>>
>>>>  journal_async_commit (as now)
>>>>  journal_checksum
>>>>  none
>>>
>>> Can you also check and see whether the presence or absence of
>>> "nobarrier" makes a difference?
>>
>> Done. (Also checked the effect of your patches posted earlier this week:
>> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
>> I was carrying out, umount -l'ing /var as the very last thing I did
>> before /sbin/reboot -f.)
>>
>> nobarrier makes a difference that I, at least, did not expect:
>>
>> [no options]                    No corruption
>>
>> nobarrier                       No corruption
>>
>>           journal_checksum      Corruption
>>                                 Corrupted transaction, journal aborted
>>                                 
>> nobarrier,journal_checksum      Corruption
>>                                 Corrupted transaction, journal aborted
>>
>>           journal_async_commit  Corruption
>>                                 Corrupted transaction, journal aborted
>>
>> nobarrier,journal_async_commit  Corruption
>>                                 No corrupted transaction or aborted journal
> 
> That's what we needed.  Woulda been great a few days ago ;)
> 
> In my testing journal_checksum is broken, and my bisection seems to
> implicate
> 
> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
> Author: Theodore Ts'o <tytso@xxxxxxx>
> Date:   Mon Feb 6 20:12:03 2012 -0500
> 
>     ext4: fold ext4_claim_inode into ext4_new_inode
>     
> as the culprit.  I haven't had time to look into why, yet.

It looks like the inode_bitmap_bh is being modified outside a transaction:

                ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);

It needs a get_write_access / handle_dirty_metadata pair around it.

A hacky patch like this seems to work but it was done 5mins before I have
to be out the door to dinner so it probably needs cleanup or at least
scrutiny.

[PATCH] ext4: fix unjournaled inode bitmap modification

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
such that the inode bitmap was being modified outside a transaction,
which could lead to corruption, and was discovered when journal_checksum
found a bad checksum in the journal.

Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx>
---

--- ialloc.c.reverted2	2012-10-27 17:31:20.351537073 -0500
+++ ialloc.c	2012-10-27 17:40:18.643553576 -0500
@@ -669,6 +669,10 @@
 		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
 		if (!inode_bitmap_bh)
 			goto fail;
+		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+		if (err)
+			goto fail;
 
 repeat_in_this_group:
 		ino = ext4_find_next_zero_bit((unsigned long *)
@@ -690,6 +694,10 @@
 		ino++;		/* the inode bitmap is zero-based */
 		if (!ret2)
 			goto got; /* we grabbed the inode! */
+		BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+		err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+		if (err)
+			goto fail;
 		if (ino < EXT4_INODES_PER_GROUP(sb))
 			goto repeat_in_this_group;
 	}




> -Eric
> 
>> I didn't expect the last case at all, and it adequately explains why you
>> are mostly seeing corrupted journal messages in your tests but I was
>> not. It also explains why when I saw this for the first time I was able
>> to mount the resulting corrupted filesystem read-write and corrupt it
>> further before I noticed that anything was wrong.
>>
>> It is also clear that journal_checksum and all that relies on it is
>> worse than useless right now, as Eric reported while I was testing this.
>> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
>> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
>>
>> It's a shame journal_async_commit depends on a broken feature: it might
>> be notionally unsafe but on some of my systems (without nobarrier or
>> flashy caching controllers) it was associated with a noticeable speedup
>> of metadata-heavy workloads -- though that was way back in 2009...
>> however, "safety first" definitely applies in this case.
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html