PATCH for filesys corruption in ext3 with data=journal

neilb@cse.unsw.edu.au (Neil Brown) · Fri, 31 May 2002 12:33:23 +1000 (EST)

Hi,

 as I mentioned in earlier mail to ext3-users I have been getting some
corruption on an ext3 filesystem that has been serving NFS.  I am now
confident that I fully understand the problem and have a patch.

It only affects data=journal mode and I wonder if it might also be the
cause of the corruption noted by a number of people on linux-kernel.

First I will explain the problem.  Then display the patch.

When a file that contains dirty pages is truncated, those pages
must be dealt with.  They can either be flushed to disc, or they can
be invalidated.  Either way they must be non-dirty before their 
location on disc is allocated to another file.  If not, they might
be flushed out to disc much later and overwrite something that has
since been allocate the same location on disc.

The pages are dealt with by truncate_inode_pages which ultimately
calls journal_flushpage on each page which inturn calls
journal_unmap_buffer.

For data=writeback the data block will not be attached to the journal
at all and the buffer is "Zapped" (marked as clean).

For data=ordered the  data block can only be dirty if it is part of
the current transaction, or the committing transaction. (This is a
consequence of the "order"ing.)
If the current transaction, then it can be simply disposed on and
removed from the current transaction.  Again it is Zapped.
If the committing transaction, then it will have to go to disc, and it
will be flushed out before that transaction fully commits, and hence
before any data in a subsequent transaction - that may be allocated to
the same place of disc - has a chance of being written out.  So there
is no problem here.

But for data=journal the situation is a bit different.
Normally with data=journal, data blocks are written to the journal and
then left in the pagecache in a dirty state to be eventually flushed
to disc. Either by bdflush or by a journal checkpoint.

journal_unmap_buffer will find that the buffer is attached to the
journal, either via b_transaction or b_cp_transaction (i.e. an active
part of a transaction, or a buffer that must be checkpointed before
that transaction be be forgotten)
If the latter case, the buffer gets reattached to the current
transaction (if there is one) so it will be disposed of at when it
completes.  otherwise it is immedately Zapped.

If the transaction it is attached to is the running transaction, then
it is safe to just remove it from the transaction and never write it
at all.

But (and here we get to the meat of it all) if the transaction it is
attached to is the committing transaction then it is not safe to
discard it.  We must leave it alone until that transaction has
completed.  We must arrange that when the transaction completes
committing, the buffer then gets discards (and importantly, not left
lying around dirty).  But this *DOES*NOT*HAPPEN*.

The code in question is at about line 1831 of fs/jbd/transaction.c
and has the comment:
		/* If it is committing, we simply cannot touch it.  We
		 * can remove it's next_transaction pointer from the
		 * running transaction if that is set, but nothing
		 * else. */

This code really must arrange that a data=journal datablock never gets
marked as dirty again (note that it isn't actualy dirty at this point,
it is JBDDirty instead, to stop if from being written to disc
prematurey).

One option might be to simply clear the JBDDirty flag at this point.  
However this causes a number of assertions later on to fail, so it
isn't a good idea.
The approach that I have taken in the patch is to make use of the
BH_Freed flag which is currently unused but is defined in jbd.h
as
	BH_Freed,		/* 1 if buffer has been freed (truncated) */

It seems perfect for the job.  At this point in journal_unmap_buffer,
we set BH_Freed on the buffer.  Then in phase 7 of the commit, when
we are looking at the t_forget buffers and considering if
they are jdirty, and hence should be made really dirty again, we
avoid doing so if BH_Freed is set.

I am yet to actually try this patch out on a production system, partly
because I wanted to be sure my understanding was good enough to be
able to convincingly explain it in Email first, and partly because it
is late in the week, and I would rather start using it on a monday
morning when I can watch what is happening (for now, I am running
data=ordered to avoid further problems).

This patch is against 2.4.19-pre8 plus ext3 0.9.18.

I have not been able to reproduce the problem except be letting 300
students loose on a live server, which isn't the most forgiving
testing environment, so at the moment this patch is largely
theoretical.  If anyone has a test situation where that can reproduce
any sort of ext3 corruption with data=journal, I would be keen to here
how this patch affects things.

The corruption that I see that I blame on this problem is directories
begin overwritten with what looks like file contents and in one case
(which required 90 minutes downtime for a fsck) some indirect blocks
being over-written by file contents.

NeilBrown

--- ./fs/jbd/commit.c	2002/05/28 04:15:18	1.1
+++ ./fs/jbd/commit.c	2002/05/28 22:44:48
@@ -663,12 +663,13 @@
 		 * there's no point in keeping a checkpoint record for
 		 * it. */
 		bh = jh2bh(jh);
-		if (buffer_jdirty(bh)) {
+		if (buffer_jdirty(bh) && !__buffer_state(bh, Freed)) {
 			JBUFFER_TRACE(jh, "add to new checkpointing trans");
 			__journal_insert_checkpoint(jh, commit_transaction);
 			JBUFFER_TRACE(jh, "refile for checkpoint writeback");
 			__journal_refile_buffer(jh);
 		} else {
+			clear_bit(BH_Freed, &bh->b_state);
 			J_ASSERT_BH(bh, !buffer_dirty(bh));
 			J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
 			__journal_unfile_buffer(jh);
--- ./fs/jbd/transaction.c	2002/05/26 23:13:05	1.2
+++ ./fs/jbd/transaction.c	2002/05/28 09:24:45
@@ -1834,6 +1834,7 @@
 		 * running transaction if that is set, but nothing
 		 * else. */
 		JBUFFER_TRACE(jh, "on committing transaction");
+		set_bit(BH_Freed, &bh->b_state);
 		if (jh->b_next_transaction) {
 			J_ASSERT(jh->b_next_transaction ==
 					journal->j_running_transaction);