Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound

Ben Hutchings <ben@xxxxxxxxxxxxxxx> · Mon, 18 Mar 2013 04:53:15 +0000

On Mon, 2013-03-18 at 15:24 +1100, George Barnett wrote:
> On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote:
[...]

> > I think a patch like this should fix things; I've run a stress test
> > with a hack to increment the transaction id by 1 << 24 after each
> > commit, to more quickly cause an tid wrap, and the regression tests
> > seem to be passing without complaint.
> Excellent news.  Again, thank you for your help in this regard.
> 
> 
> @Ben - could you let me know what your preferred course of action
> would be here?  As I'm sure you can understand, I do not wish to
> maintain a forked kernel from Debian upstream.  Is this something you
> would be prepared to integrate into the 3.2 BPO kernels?
[...]

We need you to verify that this fix works first.  If it does, it should
get included in the various 3.x.y stable branches and in Debian kernel
packages.

Ted might prefer you to test this against current mainline (3.9-rc3),
though you may find it easier to test against the version you already
have.  In the latter case you'll need a slightly modified version of the
patch (attached) and instructions at
<http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official>.

Ben.

-- 
Ben Hutchings
Never attribute to conspiracy what can adequately be explained by stupidity.
From: Theodore Ts'o <tytso@xxxxxxx>
Date: Sun, 17 Mar 2013 22:24:46 -0400
Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by
 wraparound

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@xxxxxxx>
Reported-by: Ben Hutchings <ben@xxxxxxxxxxxxxxx>
Reported-by: George Barnett <gbarnett@xxxxxxxxxxxxx>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@xxxxxxxxxxxxxxx>
---
 fs/ext4/fsync.c      |  3 +--
 fs/ext4/inode.c      |  3 +--
 fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -260,8 +260,7 @@ int ext4_sync_file(struct file *file, lo
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = true;
-	jbd2_log_start_commit(journal, commit_tid);
-	ret = jbd2_log_wait_commit(journal, commit_tid);
+	ret = jbd2_complete_transaction(journal, commit_tid);
 	if (needs_barrier)
 		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
  out:
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -146,8 +146,7 @@ void ext4_evict_inode(struct inode *inod
 			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
 
-			jbd2_log_start_commit(journal, commit_tid);
-			jbd2_log_wait_commit(journal, commit_tid);
+			jbd2_complete_transaction(journal, commit_tid);
 			filemap_write_and_wait(&inode->i_data);
 		}
 		truncate_inode_pages(&inode->i_data, 0);
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -663,6 +663,37 @@ int jbd2_log_wait_commit(journal_t *jour
 }
 
 /*
+ * When this function returns the transaction corresponding to tid
+ * will be completed.  If the transaction has currently running, start
+ * committing that transaction before waiting for it to complete.  If
+ * the transaction id is stale, it is by definition already completed,
+ * so just return SUCCESS.
+ */
+int jbd2_complete_transaction(journal_t *journal, tid_t tid)
+{
+	int	need_to_wait = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid) {
+		if (journal->j_commit_request != tid) {
+			/* transaction not yet started, so request it */
+			read_unlock(&journal->j_state_lock);
+			jbd2_log_start_commit(journal, tid);
+			goto wait_commit;
+		}
+	} else if (!(journal->j_committing_transaction &&
+		     journal->j_committing_transaction->t_tid == tid))
+		need_to_wait = 0;
+	read_unlock(&journal->j_state_lock);
+	if (!need_to_wait)
+		return 0;
+wait_commit:
+	return jbd2_log_wait_commit(journal, tid);
+}
+EXPORT_SYMBOL(jbd2_complete_transaction);
+
+/*
  * Log buffer allocation routines:
  */
 
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1165,6 +1165,7 @@ int __jbd2_log_start_commit(journal_t *j
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
 
Attachment:
signature.asc

Description: This is a digitally signed message part