Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

"Theodore Ts'o" <tytso@xxxxxxx> · Mon, 26 Aug 2024 09:32:08 -0400

On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> to flush the journal as part of the cleanup when aborting the mount
> operation:
> 
>   -> ofcs2_mount_volume (error; goto out_system_inodes)
>     -> ofcs2_journal_shutdown
>       -> jbd2_journal_flush
>         -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> ...
>
> I confirmed that setting the JBD2_ABORT flag in journal_reset before
> returning -EINVAL fixes the problem:
> 
>         static int journal_reset(journal_t *journal)
>                         journal_fail_superblock(journal);
>         +               journal->j_flags |= JBD2_ABORT;
>                         return -EINVAL;
> 
> You can find a proper patch file + the syzbot re-test result in [1].
> However, I'm not entirely sure whether this is the correct decision, and I
> wanted to confirm that this is an appropriate solution before sending a
> proper patch to the mailing list.

The reason why this isn't an issue with ext4 is because the normal
"right" way to do this is if jbd2_journal_load() returns an error, is
to call jbd2_journal_destroy() to release the data structures with the
journal --- and then don't try to use the journal afterwards.

The weird thing is that there are two code paths in ocfs2 that calls
jbd2_journal_load().  One is in ocfs2_replay_journal() which does what
ext4 does.  The other is ocfs2_load_journal() which does *not* do
this, and this is the one which you saw in the syzkaller reproducer.
It looks like one codepath is used to replay the ocfs2 for some other
node, and the is to load (and presumably later, replay) the journal
for the mounting node.

There are also some other things which look very confusing, such as the fact that ocfs2_journal_shutdown calls igrab:

	/* need to inc inode use count - jbd2_journal_destroy will iput. */
	if (!igrab(inode))
		BUG();

... which is *weird*.  Normaly the journal inode refcount is expected
to be bumped before calling jbd2_journal_init_inode() --- which
normally is done by an iget() function, and is needed to make sure the
journal inode isn't released out from under the jbd2 layer.  It looks
like ocfs2 uses the journal inode for other stuff so get the journal
inode without calling something like iget().  Which is OK, I guess,
but it means that there are a whole bunch of places where it has to
call !BUG_ON(igrab(journal->j_inode) before calling
jbd2_journal_destroy().  It would seem like the *right* thing to do is
to bump the refcount in ocfs2_journal_init(), and if for some reason
the igrab fails, it can just return an error early, instead of having
to resort to BUG_ON() later, and if you don't realize that you have to
do this weird igrab() before calling jbd2_journal_destroy(), you'll
end up leaking the journal inode.

Anyway, I think the right thing to do is to restructure how ocfs2
calls the jbd2 journal, but that's going to require a more careful
examination of the code paths.  Your patch is a bit of a blunt force
hack that should be harmless, but it's probably not the best way to
fix the problem, although doing it "right" would be more
complicated....

					- Ted