On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote: > On 2025/3/7 18:27, Ojaswin Mujoo wrote: > > On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote: > >> On 2025/3/7 16:13, Ojaswin Mujoo wrote: > >>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote: > >>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote: > >>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote: > >>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked > >>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running > >>>>>> stress tests, it was observed that in case of some error handling paths, it is > >>>>>> possible for update_super_work to start a transaction after the journal is > >>>>>> destroyed eg: > >>>>>> > >>>>>> (umount) > >>>>>> ext4_kill_sb > >>>>>> kill_block_super > >>>>>> generic_shutdown_super > >>>>>> sync_filesystem /* commits all txns */ > >>>>>> evict_inodes > >>>>>> /* might start a new txn */ > >>>>>> ext4_put_super > >>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ > >>>>>> jbd2_journal_destroy > >>>>>> journal_kill_thread > >>>>>> journal->j_flags |= JBD2_UNMOUNT; > >>>>>> jbd2_journal_commit_transaction > >>>>>> jbd2_journal_get_descriptor_buffer > >>>>>> jbd2_journal_bmap > >>>>>> ext4_journal_bmap > >>>>>> ext4_map_blocks > >>>>>> ... > >>>>>> ext4_inode_error > >>>>>> ext4_handle_error > >>>>>> schedule_work(&sbi->s_sb_upd_work) > >>>>>> > >>>>>> /* work queue kicks in */ > >>>>>> update_super_work > >>>>>> jbd2_journal_start > >>>>>> start_this_handle > >>>>>> BUG_ON(journal->j_flags & > >>>>>> JBD2_UNMOUNT) > >>>>>> > >>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is > >>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not > >>>>>> set. Otherwise, just fallback to an un-journaled commit. > >>>>>> > >>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done > >>>>>> during ext4_put_super() (except a running transaction that will get commited > >>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb > >>>>>> outside the journal as it won't race with a journaled update (refer > >>>>>> 2d01ddc86606). > >>>>>> > >>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only > >>>>>> called from mballoc and AFAICT it would be always valid to schedule work here. > >>>>>> > >>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") > >>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> > >>>>>> Suggested-by: Jan Kara <jack@xxxxxxx> > >>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > >>>>>> --- > >>>>>> fs/ext4/ext4.h | 2 ++ > >>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++ > >>>>>> fs/ext4/super.c | 4 +++- > >>>>>> 3 files changed, 13 insertions(+), 1 deletion(-) > >>>>>> > >>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > >>>>>> index 2b7d781bfcad..d48e93bd5690 100644 > >>>>>> --- a/fs/ext4/ext4.h > >>>>>> +++ b/fs/ext4/ext4.h > >>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info { > >>>>>> */ > >>>>>> struct work_struct s_sb_upd_work; > >>>>>> > >>>>>> + bool s_journal_destorying; > >>>>>> + > >>>>>> /* Atomic write unit values in bytes */ > >>>>>> unsigned int s_awu_min; > >>>>>> unsigned int s_awu_max; > >>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h > >>>>>> index 9b3c9df02a39..6bd3ca84410d 100644 > >>>>>> --- a/fs/ext4/ext4_jbd2.h > >>>>>> +++ b/fs/ext4/ext4_jbd2.h > >>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour > >>>>>> { > >>>>>> int err = 0; > >>>>>> > >>>>>> + /* > >>>>>> + * At this point all pending FS updates should be done except a possible > >>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It > >>>>>> + * is now safe for any new errors to directly commit superblock rather > >>>>>> + * than going via journal. > >>>>>> + */ > >>>>>> + sbi->s_journal_destorying = true; > >>>>>> + > >>>>> > >>>>> Hi, Ojaswin! > >>>>> > >>>>> I'm afraid you still need to flush the superblock update work here, > >>>>> otherwise I guess the race condition you mentioned in v1 could still > >>>>> occur. > >>>>> > >>>>> ext4_put_super() > >>>>> flush_work(&sbi->s_sb_upd_work) > >>>>> > >>>>> **kjournald2** > >>>>> jbd2_journal_commit_transaction() > >>>>> ... > >>>>> ext4_inode_error() > >>>>> /* JBD2_UNMOUNT not set */ > >>>>> schedule_work(s_sb_upd_work) > >>>>> > >>>>> **workqueue** > >>>>> update_super_work > >>>>> /* s_journal_destorying is not set */ > >>>>> if (journal && !s_journal_destorying) > >>>>> > >>>>> ext4_journal_destroy() > >>>>> /* set s_journal_destorying */ > >>>>> sbi->s_journal_destorying = true; > >>>>> jbd2_journal_destroy() > >>>>> journal->j_flags |= JBD2_UNMOUNT; > >>>>> > >>>>> jbd2_journal_start() > >>>>> start_this_handle() > >>>>> BUG_ON(JBD2_UNMOUNT) > >>>>> > >>>>> Thanks, > >>>>> Yi. > >>>> Hi Yi, > >>>> > >>>> Yes you are right, somehow missed this edge case :( > >>>> > >>>> Alright then, we have to move out sbi->s_journal_destroying outside the > >>>> helper. Just wondering if I should still let it be in > >>>> ext4_journal_destroy and just add an extra s_journal_destroying = false > >>>> before schedule_work(s_sb_upd_work), because it makes sense. > >>>> > >>>> Okay let me give it some thought but thanks for pointing this out! > >>>> > >>>> Regards, > >>>> ojaswin > >>> > >>> Okay so thinking about it a bit more, I see you also suggested to flush > >>> the work after marking sbi->s_journal_destroying. But will that solve > >>> it? > >>> > >>> ext4_put_super() > >>> flush_work(&sbi->s_sb_upd_work) > >>> > >>> **kjournald2** > >>> jbd2_journal_commit_transaction() > >>> ... > >>> ext4_inode_error() > >>> /* JBD2_UNMOUNT not set */ > >>> schedule_work(s_sb_upd_work) > >>> > >>> **workqueue** > >>> update_super_work > >>> /* s_journal_destorying is not set */ > >>> if (journal && !s_journal_destorying) > >>> > >>> ext4_journal_destroy() > >>> /* set s_journal_destorying */ > >>> sbi->s_journal_destorying = true; > >>> flush_work(&sbi->s_sb_upd_work) > >>> schedule_work() > >> ^^^^^^^^^^^^^^^ > >> where does this come from? > >> > >> After this flush_work, we can guarantee that the running s_sb_upd_work > >> finishes before we set JBD2_UNMOUNT. Additionally, the journal will > >> not commit transaction or call schedule_work() again because it has > >> been aborted due to the previous error. Am I missing something? > >> > >> Thanks, > >> Yi. > > > > Hmm, so I am thinking of a corner case in ext4_handle_error() where > > > > if(journal && !is_journal_destroying) > > > > is computed but schedule_work() not called yet, which is possible cause > > the cmp followed by jump is not atomic in nature. If the schedule_work > > is only called after we have done the flush then we end up with this: > > > > if (journal && !s_journal_destorying) > > ext4_journal_destroy() > > /* set s_journal_destorying */ > > sbi->s_journal_destorying = true; > > flush_work(&sbi->s_sb_upd_work) > > schedule_work() > > > > Which is possible IMO, although the window is tiny. > > Yeah, right! > Sorry for misread the location where you add the "!s_journal_destorying" > check, the graph I provided was in update_super_work(), which was wrong. Oh right, I also misread your trace but yes as discussed, even sbi->s_journal_destorying = true; flush_work() jbd2_journal_destroy() doesn't work. > The right one should be: > > ext4_put_super() > flush_work(&sbi->s_sb_upd_work) > > **kjournald2** > jbd2_journal_commit_transaction() > ... > ext4_inode_error() > /* s_journal_destorying is not set */ > if (journal && !s_journal_destorying) > (schedule_work(s_sb_upd_work)) //can be here > > ext4_journal_destroy() > /* set s_journal_destorying */ > sbi->s_journal_destorying = true; > jbd2_journal_destroy() > journal->j_flags |= JBD2_UNMOUNT; > > (schedule_work(s_sb_upd_work)) //also can be here > > **workqueue** > update_super_work() > journal = sbi->s_journal //get journal > kfree(journal) > jbd2_journal_start(journal) //journal UAF > start_this_handle() > BUG_ON(JBD2_UNMOUNT) //bugon here > > > So there are two problems here, the first one is the 'journal' UAF, > the second one is triggering JBD2_UNMOUNT flag BUGON. Indeed, there's a possible UAF here as well. > > >>> > >>> As for the fix, how about we do something like this: > >>> > >>> ext4_put_super() > >>> > >>> flush_work(&sbi->s_sb_upd_work) > >>> destroy_workqueue(sbi->rsv_conversion_wq); > >>> > >>> ext4_journal_destroy() > >>> /* set s_journal_destorying */ > >>> sbi->s_journal_destorying = true; > >>> > >>> /* trigger a commit and wait for it to complete */ > >>> > >>> flush_work(&sbi->s_sb_upd_work) > >>> > >>> jbd2_journal_destroy() > >>> journal->j_flags |= JBD2_UNMOUNT; > >>> > >>> jbd2_journal_start() > >>> start_this_handle() > >>> BUG_ON(JBD2_UNMOUNT) > >>> > >>> Still giving this codepath some thought but seems like this might just > >>> be enough to fix the race. Thoughts on this? > >>> > > I think this solution should work, the forced commit and flush_work() > should ensure that the last transaction is committed and that the > potential work is done. > > Besides, the s_journal_destorying flag is set and check concurrently > now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what > about adding a new flag into sbi->s_mount_state instead of adding > new s_journal_destorying? Right, that makes sence. I will incorporate these changes in the next revision. Thanks for the review, ojaswin > > Thanks, > Yi. > > >>> > >>>>> > >>>>>> err = jbd2_journal_destroy(journal); > >>>>>> sbi->s_journal = NULL; > >>>>>> > >>>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c > >>>>>> index 8ad664d47806..31552cf0519a 100644 > >>>>>> --- a/fs/ext4/super.c > >>>>>> +++ b/fs/ext4/super.c > >>>>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error, > >>>>>> * constraints, it may not be safe to do it right here so we > >>>>>> * defer superblock flushing to a workqueue. > >>>>>> */ > >>>>>> - if (continue_fs && journal) > >>>>>> + if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying) > >>>>>> schedule_work(&EXT4_SB(sb)->s_sb_upd_work); > >>>>>> else > >>>>>> ext4_commit_super(sb); > >>>>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb) > >>>>>> spin_lock_init(&sbi->s_error_lock); > >>>>>> INIT_WORK(&sbi->s_sb_upd_work, update_super_work); > >>>>>> > >>>>>> + sbi->s_journal_destorying = false; > >>>>>> + > >>>>>> err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed); > >>>>>> if (err) > >>>>>> goto failed_mount3; > >>>>> > >>> > >> > > >