Re: [PATCH] fix for consistency errors after crash

Jan Kara <jack@xxxxxxx> · Mon, 12 Jul 2010 21:37:27 +0200

Hi,

> I've seen you guys had some open RH bugs on ext3, who all share in
> common the "bit already free" error.
> 
> This bug I reported can explain many different problems in ext[34].
> 
> Essentially, every time there is a kernel crash (or hard reboot)
> during delete/truncate of a large file,
> it may result in "bit already clear" error after reboot.
> 
> The problem is very simple and so is the fix.
> I proved the problem with 100% recreation chances using a small patch,
> instead of running statistical stress tests.
> All I did was to add a print and 10 seconds delay after transaction
> restart in ext3_free_branches and reboot > 5 seconds after the
> transaction restarts, so that kjournald will have time to commit the
> old transaction.
> After the reboot, I always get "bit already clear" errors, because the
> "half large truncate" transaction is not handled properly.
> 
> I did not get any response from ext4 guys so far and since this bug
> dates back to ext3,
> I was hoping you guys could take a look and put your weight on pushing
> the fix upstream.
  Thanks for a ping. Your analysis and the fix looks correct to me. Attached is
a fix of the problem for ext3 which I'll merge if noone objects. BTW: I've also
updated a comment a bit so you might want to include than in an ext4 patch as
well.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SuSE CR Labs
>From 6c91cac7b34f60bf89cb58a9fe753918628d22c4 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Mon, 12 Jul 2010 21:04:31 +0200
Subject: [PATCH] ext3: Avoid filesystem corruption after a crash under heavy delete load

It can happen that ext3_free_branches calls ext3_forget() for an indirect
block in an earlier transaction than ext3_free_blocks() for that block.
Thus if we crash before a transaction with ext3_free_blocks() is committed,
we will see indirect block pointing to already freed blocks and complain
during orphan list cleanup.

The fix is simple: Make sure ext3_forget() is called in the same transaction
as ext3_free_blocks().

This is a backport of an ext4 fix by Amir G. <amir73il@xxxxxxxxxxxxxxxxxxxxx>

Signed-off-by: Jan Kara <jack@xxxxxxx>
---
 fs/ext3/inode.c |   46 +++++++++++++++++++++++++---------------------
 1 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index a786db4..436e5bb 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -2270,27 +2270,6 @@ static void ext3_free_branches(handle_t *handle, struct inode *inode,
 					   depth);
 
 			/*
-			 * We've probably journalled the indirect block several
-			 * times during the truncate.  But it's no longer
-			 * needed and we now drop it from the transaction via
-			 * journal_revoke().
-			 *
-			 * That's easy if it's exclusively part of this
-			 * transaction.  But if it's part of the committing
-			 * transaction then journal_forget() will simply
-			 * brelse() it.  That means that if the underlying
-			 * block is reallocated in ext3_get_block(),
-			 * unmap_underlying_metadata() will find this block
-			 * and will try to get rid of it.  damn, damn.
-			 *
-			 * If this block has already been committed to the
-			 * journal, a revoke record will be written.  And
-			 * revoke records must be emitted *before* clearing
-			 * this block's bit in the bitmaps.
-			 */
-			ext3_forget(handle, 1, inode, bh, bh->b_blocknr);
-
-			/*
 			 * Everything below this this pointer has been
 			 * released.  Now let this top-of-subtree go.
 			 *
@@ -2313,6 +2292,31 @@ static void ext3_free_branches(handle_t *handle, struct inode *inode,
 				truncate_restart_transaction(handle, inode);
 			}
 
+			/*
+			 * We've probably journalled the indirect block several
+			 * times during the truncate.  But it's no longer
+			 * needed and we now drop it from the transaction via
+			 * journal_revoke().
+			 *
+			 * That's easy if it's exclusively part of this
+			 * transaction.  But if it's part of the committing
+			 * transaction then journal_forget() will simply
+			 * brelse() it.  That means that if the underlying
+			 * block is reallocated in ext3_get_block(),
+			 * unmap_underlying_metadata() will find this block
+			 * and will try to get rid of it.  damn, damn. Thus
+			 * we don't allow a block to be reallocated until
+			 * a transaction freeing it has fully committed.
+			 *
+			 * We also have to make sure journal replay after a
+			 * crash does not overwrite non-journaled data blocks
+			 * with old metadata when the block got reallocated for
+			 * data.  Thus we have to store a revoke record for a
+			 * block in the same transaction in which we free the
+			 * block.
+			 */
+			ext3_forget(handle, 1, inode, bh, bh->b_blocknr);
+
 			ext3_free_blocks(handle, inode, nr, 1);
 
 			if (parent_bh) {
-- 
1.6.4.2