On 10/28/12 9:30 PM, Theodore Ts'o wrote: > On Sat, Oct 27, 2012 at 11:23:57PM -0500, Eric Sandeen wrote: >> A little more going on here to try to properly handle error >> cases & moving to the next group; despite >> ext4_handle_release_buffer being a no-op, I've tried >> to sprinkle it in at the right places. Double checking >> on review is probably a fine idea ;) > > Sorry, I didn't see your newer version of your patch. I'm not > convinced it's worth it to try to get the calls to > ext4_handle_release_buffer() right. There are plenty of other places > where we're not calling ext4_handle_release_buffer(), and I'm not > convinced it would ever be useful to make it be something other than a > no-op. Fair enough, I went a little overboard. > In order to make it be useful, we'd have to enforce a rule > that every single get_write_access() was matched with either a > handle_dirty_metadata() or a handle_release_buffer(). That would be > tricky; worse, we'd have to keep track of a refcount on each bh, which > would cost us on the scalability front. The main benefit would be > that might be able to be able to reclaim bh's where we called > get_write_access() and then changed our mind, but that's relatively > rare, and I think it's easier to simply be more careful about calling > get_write_acceess() until we're sure we're going to need write access. > > Hence in my version of the patch, I've waited until right before the > call to ext4_lock_group() before calling get_write_access(). Note > that it's safe to call get_write_access() on a bh twice; the second > time the jbd2 layer will notice that the bh is already a part of the > transaction. Yeah, I guess that's the norm. So on the one hand you delay calling it until we're sure we need it; OTOH it's no big deal if it does get called twice :) > Also, leaving out the calls to ext4_handle_release_buffer() makes the > patch easier to understand and reason about. Fair enough. > What do you think of this version? Looks fine, tests fine. Ship it ;) -Eric > - Ted > > commit 67d725143e9e7ea458a0c1c4a6625657c3dc7ba2 > Author: Eric Sandeen <sandeen@xxxxxxxxxx> > Date: Sun Oct 28 22:24:57 2012 -0400 > > ext4: fix unjournaled inode bitmap modification > > commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed > ext4_new_inode() such that the inode bitmap was being modified > outside a transaction, which could lead to corruption, and was > discovered when journal_checksum found a bad checksum in the > journal during log replay. > > Nix ran into this when using the journal_async_commit mount > option, which enables journal checksumming. The ensuing > journal replay failures due to the bad checksums led to > filesystem corruption reported as the now infamous > "Apparent serious progressive ext4 data corruption bug" > > [ Changed by tytso to only call ext4_journal_get_write_access() only > when we're fairly certain that we're going to allocate the inode. ] > > I've tested this by mounting with journal_checksum and > running fsstress then dropping power; I've also tested by > hacking DM to create snapshots w/o first quiescing, which > allows me to test journal replay repeatedly w/o actually > power-cycling the box. Without the patch I hit a journal > checksum error every time. With this fix it survives > many iterations. > > Reported-by: Nix <nix@xxxxxxxxxxxxx> > Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx> > Signed-off-by: "Theodore Ts'o" <tytso@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c > index 4facdd2..3a100e7 100644 > --- a/fs/ext4/ialloc.c > +++ b/fs/ext4/ialloc.c > @@ -725,6 +725,10 @@ repeat_in_this_group: > "inode=%lu", ino + 1); > continue; > } > + BUFFER_TRACE(inode_bitmap_bh, "get_write_access"); > + err = ext4_journal_get_write_access(handle, inode_bitmap_bh); > + if (err) > + goto fail; > ext4_lock_group(sb, group); > ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data); > ext4_unlock_group(sb, group); > @@ -738,6 +742,11 @@ repeat_in_this_group: > goto out; > > got: > + BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); > + err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh); > + if (err) > + goto fail; > + > /* We may have to initialize the block bitmap if it isn't already */ > if (ext4_has_group_desc_csum(sb) && > gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { > @@ -771,11 +780,6 @@ got: > goto fail; > } > > - BUFFER_TRACE(inode_bitmap_bh, "get_write_access"); > - err = ext4_journal_get_write_access(handle, inode_bitmap_bh); > - if (err) > - goto fail; > - > BUFFER_TRACE(group_desc_bh, "get_write_access"); > err = ext4_journal_get_write_access(handle, group_desc_bh); > if (err) > @@ -823,11 +827,6 @@ got: > } > ext4_unlock_group(sb, group); > > - BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); > - err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh); > - if (err) > - goto fail; > - > BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata"); > err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh); > if (err) > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html