Re: 2.6.32 ext3 assertion j_running_transaction != NULL fails in commit.c

"Ted Ts'o" <tytso@xxxxxxx> · Mon, 25 Apr 2011 19:14:54 -0400

On Thu, Apr 21, 2011 at 09:17:57AM -0500, Martin_Zielinski@xxxxxxxxxx wrote:
> 
> I posted this BUG already on the ext3-users list without response.
> After making some new observations I hope, that someone here can
> tell me these make sense. Kernel output of the BUG is at the end of
> the mail.

Hi Martin,

Thanks for your observations.  I don't necessarily always follow mail
sent to ext3-users, but fortunately I saw this note sent to the LKML
list.  

> Here's some debug output that I put into the code:
> kernel: (fs/ext3/fsync.c, 77): ext3_sync_file: ext3_sync_file datasync=1 d_tid=27807 tid=27846
> kernel: (fs/jbd/journal.c, 467): log_start_commit: log start commit called with commit request=27845, tid=27807 running transaction=ffff8800266913c0 27846
> 
> So the "really-commited" transaction id was advancing while this
> datasync_tid stayed the same and journal.c - log_start_commit() was
> called without waking the commit process.
> 
> I wondered what happens if the current journal tid is overflowing
> (32bit unsigned integer). By forcing the tid in get_transaction to
> jump close to UINT_MAX, I could reproduce the BUG.

A simple overflow shouldn't cause the problem, because of how
tid_geq() is coded.  However, if there have been 2**31 commits since
the fdatasync file has been opened, it's possible to trigger this.
That's a **lot** of commits, so I'm not sure I'm completely happy with
this theory.  Nevertheless, I believe this set of patches (one for
ext4, and one for ext3), should prevent the crash from happening.

      	      	  	 		    	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html