Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

Joel Becker <jlbec@xxxxxxxxxxxx> · Mon, 26 Aug 2024 12:34:53 -0700

On Mon, Aug 26, 2024 at 09:32:08AM -0400, Theodore Ts'o wrote:
> On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> > Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> > However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> > journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> > to flush the journal as part of the cleanup when aborting the mount
> > operation:
> > 
> >   -> ofcs2_mount_volume (error; goto out_system_inodes)
> >     -> ofcs2_journal_shutdown
> >       -> jbd2_journal_flush
> >         -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > ...
> The reason why this isn't an issue with ext4 is because the normal
> "right" way to do this is if jbd2_journal_load() returns an error, is
> to call jbd2_journal_destroy() to release the data structures with the
> journal --- and then don't try to use the journal afterwards.
> 
> The weird thing is that there are two code paths in ocfs2 that calls
> jbd2_journal_load().  One is in ocfs2_replay_journal() which does what
> ext4 does.  The other is ocfs2_load_journal() which does *not* do
> this, and this is the one which you saw in the syzkaller reproducer.
> It looks like one codepath is used to replay the ocfs2 for some other
> node, and the is to load (and presumably later, replay) the journal
> for the mounting node.

You are correct, Ted, that one path is for the local journal and the
other is to recover remote journals for other nodes that may have
crashed.

I think the big ordering issue is that we set
osb->journal->j_state=OCFS2_JOURNAL_LOADED in ocfs2_journal_init(),
before we've attempted any replay.  Later in ocfs2_journal_shutdown(),
we check this state and decide to perform cleanup.

Instead, we should not set OCFS2_JOURNAL_LOADED until
ocfs2_journal_load() has called jbd2_journal_load().  Only then do we
actually know we have loaded a valid journal.

Something like:

```
	status = jbd2_journal_load(journal->j_journal);
	if (status < 0) {
		mlog(ML_ERROR, "Failed to load journal!\n");
		BUG_ON(!igrab(journal->j_inode));
		jbd2_journal_destroy(journal->j_journal);
		iput(journal->j_inode)
		goto done;
	}
	journal->j_state = OCFS2_JOURNAL_LOADED;
```

With code like this, when jbd2_journal_load() fails, the future
ocfs2_journal_shutdown() will exit early, because !OCFS2_JOURNAL_LOADED.

I think this is the right spot; a quick audit of the paths (it has been
a while) doesn't find any other outstanding state; the rest of journal
startup, such as the commit thread etc, only happen after this.

> jbd2_journal_destroy().  It would seem like the *right* thing to do is
> to bump the refcount in ocfs2_journal_init(), and if for some reason
> the igrab fails, it can just return an error early, instead of having
> to resort to BUG_ON() later, and if you don't realize that you have to
> do this weird igrab() before calling jbd2_journal_destroy(), you'll
> end up leaking the journal inode.

There are interactions of journal inodes for nodes we don't own, and
also connections to cluster locks for our own journal (don't replay
ourselves while another node has locked it and is recovering us).  So we
do have some state to keep track of.  But it's been so long that I don't
recall if there was a specific reason we do this late via igrab(), or if
it's just that we should have done as you describe and missed it.
You'll note that I copied the igrab/iput game in my snippet above.

Should someone try to audit the igrab/iput thing later?  Yes.  But it's
not a necessary part of this fix.

Thanks,
Joel

-- 

"War doesn't determine who's right; war determines who's left."

			http://www.jlbec.org/
			jlbec@xxxxxxxxxxxx