On Mon, 2024-08-26 at 12:34 -0700, Joel Becker wrote: > On Mon, Aug 26, 2024 at 09:32:08AM -0400, Theodore Ts'o wrote: > > On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote: > > > Since the disk data is bogus, journal_reset fails with -EINVAL > > > ("JBD2: > > > Journal too short (blocks 2-1024)"); this leaves journal->j_head > > > == NULL. > > > However, jbd2_journal_load clears the JBD2_ABORT flag right > > > before calling > > > journal_reset. This leads to a problem later when > > > ofcs2_mount_volume tries > > > to flush the journal as part of the cleanup when aborting the > > > mount > > > operation: > > > > > > -> ofcs2_mount_volume (error; goto out_system_inodes) > > > -> ofcs2_journal_shutdown > > > -> jbd2_journal_flush > > > -> jbd2_cleanup_journal_tail (J_ASSERT fails) > > > ... > > The reason why this isn't an issue with ext4 is because the normal > > "right" way to do this is if jbd2_journal_load() returns an error, > > is > > to call jbd2_journal_destroy() to release the data structures with > > the > > journal --- and then don't try to use the journal afterwards. > > > > The weird thing is that there are two code paths in ocfs2 that > > calls > > jbd2_journal_load(). One is in ocfs2_replay_journal() which does > > what > > ext4 does. The other is ocfs2_load_journal() which does *not* do > > this, and this is the one which you saw in the syzkaller > > reproducer. > > It looks like one codepath is used to replay the ocfs2 for some > > other > > node, and the is to load (and presumably later, replay) the journal > > for the mounting node. > > You are correct, Ted, that one path is for the local journal and the > other is to recover remote journals for other nodes that may have > crashed. > > I think the big ordering issue is that we set > osb->journal->j_state=OCFS2_JOURNAL_LOADED in ocfs2_journal_init(), > before we've attempted any replay. Later in > ocfs2_journal_shutdown(), > we check this state and decide to perform cleanup. > > Instead, we should not set OCFS2_JOURNAL_LOADED until > ocfs2_journal_load() has called jbd2_journal_load(). Only then do we > actually know we have loaded a valid journal. Yeah, it's right. We should not set OCFS2_JOURNAL_LOADED in ocfs2_journal_load(). Instead, we should set other flag like OCFS2_JOURNAL_INIT, to indicate that resources have been allocated. This way, we can clean them up properly in ocfs2_journal_shutdown(). We should distinguish between these two states to ensure the correct exit procedure when an error occurs, just like this patch[1] does. [1]: https://lore.kernel.org/ocfs2-devel/20240823083150.17590-1-sunjunchao2870@xxxxxxxxx/ > > Something like: > > ``` > status = jbd2_journal_load(journal->j_journal); > if (status < 0) { > mlog(ML_ERROR, "Failed to load journal!\n"); > BUG_ON(!igrab(journal->j_inode)); > jbd2_journal_destroy(journal->j_journal); > iput(journal->j_inode) > goto done; > } > journal->j_state = OCFS2_JOURNAL_LOADED; > ``` > > With code like this, when jbd2_journal_load() fails, the future > ocfs2_journal_shutdown() will exit early, because > !OCFS2_JOURNAL_LOADED. > > I think this is the right spot; a quick audit of the paths (it has > been > a while) doesn't find any other outstanding state; the rest of > journal > startup, such as the commit thread etc, only happen after this. > > > jbd2_journal_destroy(). It would seem like the *right* thing to do > > is > > to bump the refcount in ocfs2_journal_init(), and if for some > > reason > > the igrab fails, it can just return an error early, instead of > > having > > to resort to BUG_ON() later, and if you don't realize that you have > > to > > do this weird igrab() before calling jbd2_journal_destroy(), you'll > > end up leaking the journal inode. > > There are interactions of journal inodes for nodes we don't own, and > also connections to cluster locks for our own journal (don't replay > ourselves while another node has locked it and is recovering us). So > we > do have some state to keep track of. But it's been so long that I > don't > recall if there was a specific reason we do this late via igrab(), or > if > it's just that we should have done as you describe and missed it. > You'll note that I copied the igrab/iput game in my snippet above. > > Should someone try to audit the igrab/iput thing later? Yes. But > it's > not a necessary part of this fix. > > Thanks, > Joel > Thanks, -- Julian Sun <sunjunchao2870@xxxxxxxxx>