On 2025/3/7 18:27, Ojaswin Mujoo wrote: > On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote: >> On 2025/3/7 16:13, Ojaswin Mujoo wrote: >>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote: >>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote: >>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote: >>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked >>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running >>>>>> stress tests, it was observed that in case of some error handling paths, it is >>>>>> possible for update_super_work to start a transaction after the journal is >>>>>> destroyed eg: >>>>>> >>>>>> (umount) >>>>>> ext4_kill_sb >>>>>> kill_block_super >>>>>> generic_shutdown_super >>>>>> sync_filesystem /* commits all txns */ >>>>>> evict_inodes >>>>>> /* might start a new txn */ >>>>>> ext4_put_super >>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ >>>>>> jbd2_journal_destroy >>>>>> journal_kill_thread >>>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>>> jbd2_journal_commit_transaction >>>>>> jbd2_journal_get_descriptor_buffer >>>>>> jbd2_journal_bmap >>>>>> ext4_journal_bmap >>>>>> ext4_map_blocks >>>>>> ... >>>>>> ext4_inode_error >>>>>> ext4_handle_error >>>>>> schedule_work(&sbi->s_sb_upd_work) >>>>>> >>>>>> /* work queue kicks in */ >>>>>> update_super_work >>>>>> jbd2_journal_start >>>>>> start_this_handle >>>>>> BUG_ON(journal->j_flags & >>>>>> JBD2_UNMOUNT) >>>>>> >>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is >>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not >>>>>> set. Otherwise, just fallback to an un-journaled commit. >>>>>> >>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done >>>>>> during ext4_put_super() (except a running transaction that will get commited >>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb >>>>>> outside the journal as it won't race with a journaled update (refer >>>>>> 2d01ddc86606). >>>>>> >>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only >>>>>> called from mballoc and AFAICT it would be always valid to schedule work here. >>>>>> >>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") >>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> >>>>>> Suggested-by: Jan Kara <jack@xxxxxxx> >>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> >>>>>> --- >>>>>> fs/ext4/ext4.h | 2 ++ >>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++ >>>>>> fs/ext4/super.c | 4 +++- >>>>>> 3 files changed, 13 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >>>>>> index 2b7d781bfcad..d48e93bd5690 100644 >>>>>> --- a/fs/ext4/ext4.h >>>>>> +++ b/fs/ext4/ext4.h >>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info { >>>>>> */ >>>>>> struct work_struct s_sb_upd_work; >>>>>> >>>>>> + bool s_journal_destorying; >>>>>> + >>>>>> /* Atomic write unit values in bytes */ >>>>>> unsigned int s_awu_min; >>>>>> unsigned int s_awu_max; >>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h >>>>>> index 9b3c9df02a39..6bd3ca84410d 100644 >>>>>> --- a/fs/ext4/ext4_jbd2.h >>>>>> +++ b/fs/ext4/ext4_jbd2.h >>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour >>>>>> { >>>>>> int err = 0; >>>>>> >>>>>> + /* >>>>>> + * At this point all pending FS updates should be done except a possible >>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It >>>>>> + * is now safe for any new errors to directly commit superblock rather >>>>>> + * than going via journal. >>>>>> + */ >>>>>> + sbi->s_journal_destorying = true; >>>>>> + >>>>> >>>>> Hi, Ojaswin! >>>>> >>>>> I'm afraid you still need to flush the superblock update work here, >>>>> otherwise I guess the race condition you mentioned in v1 could still >>>>> occur. >>>>> >>>>> ext4_put_super() >>>>> flush_work(&sbi->s_sb_upd_work) >>>>> >>>>> **kjournald2** >>>>> jbd2_journal_commit_transaction() >>>>> ... >>>>> ext4_inode_error() >>>>> /* JBD2_UNMOUNT not set */ >>>>> schedule_work(s_sb_upd_work) >>>>> >>>>> **workqueue** >>>>> update_super_work >>>>> /* s_journal_destorying is not set */ >>>>> if (journal && !s_journal_destorying) >>>>> >>>>> ext4_journal_destroy() >>>>> /* set s_journal_destorying */ >>>>> sbi->s_journal_destorying = true; >>>>> jbd2_journal_destroy() >>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>> >>>>> jbd2_journal_start() >>>>> start_this_handle() >>>>> BUG_ON(JBD2_UNMOUNT) >>>>> >>>>> Thanks, >>>>> Yi. >>>> Hi Yi, >>>> >>>> Yes you are right, somehow missed this edge case :( >>>> >>>> Alright then, we have to move out sbi->s_journal_destroying outside the >>>> helper. Just wondering if I should still let it be in >>>> ext4_journal_destroy and just add an extra s_journal_destroying = false >>>> before schedule_work(s_sb_upd_work), because it makes sense. >>>> >>>> Okay let me give it some thought but thanks for pointing this out! >>>> >>>> Regards, >>>> ojaswin >>> >>> Okay so thinking about it a bit more, I see you also suggested to flush >>> the work after marking sbi->s_journal_destroying. But will that solve >>> it? >>> >>> ext4_put_super() >>> flush_work(&sbi->s_sb_upd_work) >>> >>> **kjournald2** >>> jbd2_journal_commit_transaction() >>> ... >>> ext4_inode_error() >>> /* JBD2_UNMOUNT not set */ >>> schedule_work(s_sb_upd_work) >>> >>> **workqueue** >>> update_super_work >>> /* s_journal_destorying is not set */ >>> if (journal && !s_journal_destorying) >>> >>> ext4_journal_destroy() >>> /* set s_journal_destorying */ >>> sbi->s_journal_destorying = true; >>> flush_work(&sbi->s_sb_upd_work) >>> schedule_work() >> ^^^^^^^^^^^^^^^ >> where does this come from? >> >> After this flush_work, we can guarantee that the running s_sb_upd_work >> finishes before we set JBD2_UNMOUNT. Additionally, the journal will >> not commit transaction or call schedule_work() again because it has >> been aborted due to the previous error. Am I missing something? >> >> Thanks, >> Yi. > > Hmm, so I am thinking of a corner case in ext4_handle_error() where > > if(journal && !is_journal_destroying) > > is computed but schedule_work() not called yet, which is possible cause > the cmp followed by jump is not atomic in nature. If the schedule_work > is only called after we have done the flush then we end up with this: > > if (journal && !s_journal_destorying) > ext4_journal_destroy() > /* set s_journal_destorying */ > sbi->s_journal_destorying = true; > flush_work(&sbi->s_sb_upd_work) > schedule_work() > > Which is possible IMO, although the window is tiny. Yeah, right! Sorry for misread the location where you add the "!s_journal_destorying" check, the graph I provided was in update_super_work(), which was wrong. The right one should be: ext4_put_super() flush_work(&sbi->s_sb_upd_work) **kjournald2** jbd2_journal_commit_transaction() ... ext4_inode_error() /* s_journal_destorying is not set */ if (journal && !s_journal_destorying) (schedule_work(s_sb_upd_work)) //can be here ext4_journal_destroy() /* set s_journal_destorying */ sbi->s_journal_destorying = true; jbd2_journal_destroy() journal->j_flags |= JBD2_UNMOUNT; (schedule_work(s_sb_upd_work)) //also can be here **workqueue** update_super_work() journal = sbi->s_journal //get journal kfree(journal) jbd2_journal_start(journal) //journal UAF start_this_handle() BUG_ON(JBD2_UNMOUNT) //bugon here So there are two problems here, the first one is the 'journal' UAF, the second one is triggering JBD2_UNMOUNT flag BUGON. >>> >>> As for the fix, how about we do something like this: >>> >>> ext4_put_super() >>> >>> flush_work(&sbi->s_sb_upd_work) >>> destroy_workqueue(sbi->rsv_conversion_wq); >>> >>> ext4_journal_destroy() >>> /* set s_journal_destorying */ >>> sbi->s_journal_destorying = true; >>> >>> /* trigger a commit and wait for it to complete */ >>> >>> flush_work(&sbi->s_sb_upd_work) >>> >>> jbd2_journal_destroy() >>> journal->j_flags |= JBD2_UNMOUNT; >>> >>> jbd2_journal_start() >>> start_this_handle() >>> BUG_ON(JBD2_UNMOUNT) >>> >>> Still giving this codepath some thought but seems like this might just >>> be enough to fix the race. Thoughts on this? >>> I think this solution should work, the forced commit and flush_work() should ensure that the last transaction is committed and that the potential work is done. Besides, the s_journal_destorying flag is set and check concurrently now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what about adding a new flag into sbi->s_mount_state instead of adding new s_journal_destorying? Thanks, Yi. >>> >>>>> >>>>>> err = jbd2_journal_destroy(journal); >>>>>> sbi->s_journal = NULL; >>>>>> >>>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c >>>>>> index 8ad664d47806..31552cf0519a 100644 >>>>>> --- a/fs/ext4/super.c >>>>>> +++ b/fs/ext4/super.c >>>>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error, >>>>>> * constraints, it may not be safe to do it right here so we >>>>>> * defer superblock flushing to a workqueue. >>>>>> */ >>>>>> - if (continue_fs && journal) >>>>>> + if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying) >>>>>> schedule_work(&EXT4_SB(sb)->s_sb_upd_work); >>>>>> else >>>>>> ext4_commit_super(sb); >>>>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb) >>>>>> spin_lock_init(&sbi->s_error_lock); >>>>>> INIT_WORK(&sbi->s_sb_upd_work, update_super_work); >>>>>> >>>>>> + sbi->s_journal_destorying = false; >>>>>> + >>>>>> err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed); >>>>>> if (err) >>>>>> goto failed_mount3; >>>>> >>> >> >