Re: [RFC] Bug in journal_commit_transaction?

Jan Kara <jack@xxxxxxx> · Thu, 20 Apr 2006 10:47:53 +0200

Hello,

>   sorry, I was probably to terse in my previous mail. I'll try to
> describe more details now.
> 
> > >   I think I have found a possible cause of assertion failure in
> > > journal_commit_buffer() on line 793 (b_next_transaction == NULL) (which
> > 
> > you mean journal_commit_transaction().
>   Yes.
> 
> > > was reported about two times as I remember). And I think the problem is
> > > just a few lines below in cond_resched_lock(&journal->j_list_lock);
> > >   The problem is following: we are processing t_forget list of a
> > > committing transaction. On this forget list are among other buffers also
> > > bitmaps and buffers that are freed by this transaction. Now if bitmap
> > > buffer is processed first, we switch to new bitmap (hence the buffer
> > > freed by this transaction is again available for allocation). If we
> > > reschedule on that cond_resched_lock() and in the meantime someone
> > > allocates the buffer, we later fail with that assertion failure.
> > 
> > I assume you're referring to the cond_resched_lock() at line 799.
> > 
> > I don't really follow your description here (it seems to be missing bits),
> > but that loop drops that lock in other places - why wouldn't those places
> > also be vulnerable?
>   Yes, I meant that cond_resched_lock(). But you are right that the
> problem is not in cond_resched_lock() itself. It is in the pure fact
> that we can process bitmap buffer in t_forget list before all the
> buffers transaction has freed. Explanation is below.
>   When we are freeing some blocks we do journal_get_undo_access() on the
> bitmap buffer. That copies current state of the bitmap into
> b_committed_data. When we allocate any new blocks we check that the
> block is free in both b_committed_data (old state) and in b_data (new
> state). That makes sure that we do not reallocate any freed block before
> transaction that frees it is committed. But unlocking j_list_lock when
> processing t_forget list in journal_commit_transaction() can actually
> break that assumption in some cases:
>   If we commit the transaction we write the bitmap buffer into the log
> and then file it into BJ_Forget list. We also write the freed data
> buffer (e.g. zeroed indirect block) and file it into BJ_Forget list.
> Then we start processing BJ_Forget list. First we find the bitmap
> buffer. In line 748 we free the b_committed_data stored by
> journal_get_undo_access(). From now on the allocation code in
> ext3_test_allocatable() will no longer find b_committed_data and hence
> it will assume the block is freed. *BUT* we still have the freed buffer
> in t_forget list. If it happens that we reallocate the block before
> we process the buffer in t_forget list, it will have b_next_transaction
> set and we fail at the assertion failure in line 793.
>   I hope it is clearer now.
  I've got no answer but I still think my argument was sound ;) Below
is a patch also with a verbose comment what it fixes and how. Andrew,
can you put it in -mm, please? Currently it's only basically tested
but some guys from IBM, who are able to trigger that assertion, should
give it a good beating soon...

							Bye
								Honza
-- 
Jan Kara <jack@xxxxxxx>
SuSE CR Labs


Fix possible assertion failure in journal_commit_transaction() on
jh->b_next_transaction == NULL (when we are processing BJ_Forget list and
buffer is not jbddirty). !jbddirty buffers can be placed on BJ_Forget list
for example by journal_forget() or by __dispose_buffer() - generally such
buffer means that it has been freed by this transaction. Freed buffers should
not be reallocated until the transaction has committed (that's why we have
the assertion there) but they *can* be reallocated when the transaction
has already been committed to disk and we are just processing the BJ_Forget
list (as soon as we remove b_committed_data from the bitmap bh, ext3 will
be able to reallocate buffers freed by the committing transaction). So
we have to also count with the case that the buffer has been reallocated and
b_next_transaction has been already set.
  And one more subtle point: it can happen that we manage to reallocate the
buffer and also mark it jbddirty. Then we also add the freed buffer to
the checkpoint list of the committing trasaction. But that should do no harm.

Signed-off-by: Jan Kara <jack@xxxxxxx>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16/fs/jbd/commit.c linux-2.6.16-1-realloc_freed_fix/fs/jbd/commit.c

--- linux-2.6.16/fs/jbd/commit.c	2006-01-15 00:20:12.000000000 +0100
+++ linux-2.6.16-1-realloc_freed_fix/fs/jbd/commit.c	2006-04-20 10:32:15.000000000 +0200
@@ -790,11 +790,22 @@ restart_loop:
 			jbd_unlock_bh_state(bh);
 		} else {
 			J_ASSERT_BH(bh, !buffer_dirty(bh));
-			J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
-			__journal_unfile_buffer(jh);
-			jbd_unlock_bh_state(bh);
-			journal_remove_journal_head(bh);  /* needs a brelse */
-			release_buffer_page(bh);
+			/* The buffer on BJ_Forget list and not jbddirty means
+			 * it has been freed by this transaction and hence it
+			 * could not have been reallocated until this
+			 * transaction has committed. *BUT* it could be
+			 * reallocated once we have written all the data to
+			 * disk and before we process the buffer on BJ_Forget
+			 * list. */
+			JBUFFER_TRACE(jh, "refile or unfile freed buffer");
+			__journal_refile_buffer(jh);
+			if (!jh->b_transaction) {
+				jbd_unlock_bh_state(bh);
+				 /* needs a brelse */
+				journal_remove_journal_head(bh);
+				release_buffer_page(bh);
+			} else
+				jbd_unlock_bh_state(bh);
 		}
 		cond_resched_lock(&journal->j_list_lock);
 	}
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html