On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote: > On 2025/3/8 1:26, Ojaswin Mujoo wrote: > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote: > >> On 2025/3/7 18:27, Ojaswin Mujoo wrote: > >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote: > >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote: > >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote: > >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote: > >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote: > >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked > >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running > >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is > >>>>>>>> possible for update_super_work to start a transaction after the journal is > >>>>>>>> destroyed eg: > >>>>>>>> > >>>>>>>> (umount) > >>>>>>>> ext4_kill_sb > >>>>>>>> kill_block_super > >>>>>>>> generic_shutdown_super > >>>>>>>> sync_filesystem /* commits all txns */ > >>>>>>>> evict_inodes > >>>>>>>> /* might start a new txn */ > >>>>>>>> ext4_put_super > >>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ > >>>>>>>> jbd2_journal_destroy > >>>>>>>> journal_kill_thread > >>>>>>>> journal->j_flags |= JBD2_UNMOUNT; > >>>>>>>> jbd2_journal_commit_transaction > >>>>>>>> jbd2_journal_get_descriptor_buffer > >>>>>>>> jbd2_journal_bmap > >>>>>>>> ext4_journal_bmap > >>>>>>>> ext4_map_blocks > >>>>>>>> ... > >>>>>>>> ext4_inode_error > >>>>>>>> ext4_handle_error > >>>>>>>> schedule_work(&sbi->s_sb_upd_work) > >>>>>>>> > >>>>>>>> /* work queue kicks in */ > >>>>>>>> update_super_work > >>>>>>>> jbd2_journal_start > >>>>>>>> start_this_handle > >>>>>>>> BUG_ON(journal->j_flags & > >>>>>>>> JBD2_UNMOUNT) > >>>>>>>> > >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is > >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not > >>>>>>>> set. Otherwise, just fallback to an un-journaled commit. > >>>>>>>> > >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done > >>>>>>>> during ext4_put_super() (except a running transaction that will get commited > >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb > >>>>>>>> outside the journal as it won't race with a journaled update (refer > >>>>>>>> 2d01ddc86606). > >>>>>>>> > >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only > >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here. > >>>>>>>> > >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") > >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> > >>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx> > >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > >>>>>>>> --- > >>>>>>>> fs/ext4/ext4.h | 2 ++ > >>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++ > >>>>>>>> fs/ext4/super.c | 4 +++- > >>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-) > >>>>>>>> > >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644 > >>>>>>>> --- a/fs/ext4/ext4.h > >>>>>>>> +++ b/fs/ext4/ext4.h > >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info { > >>>>>>>> */ > >>>>>>>> struct work_struct s_sb_upd_work; > >>>>>>>> > >>>>>>>> + bool s_journal_destorying; > >>>>>>>> + > >>>>>>>> /* Atomic write unit values in bytes */ > >>>>>>>> unsigned int s_awu_min; > >>>>>>>> unsigned int s_awu_max; > >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h > >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644 > >>>>>>>> --- a/fs/ext4/ext4_jbd2.h > >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h > >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour > >>>>>>>> { > >>>>>>>> int err = 0; > >>>>>>>> > >>>>>>>> + /* > >>>>>>>> + * At this point all pending FS updates should be done except a possible > >>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It > >>>>>>>> + * is now safe for any new errors to directly commit superblock rather > >>>>>>>> + * than going via journal. > >>>>>>>> + */ > >>>>>>>> + sbi->s_journal_destorying = true; > >>>>>>>> + > >>>>>>> > >>>>>>> Hi, Ojaswin! > >>>>>>> > >>>>>>> I'm afraid you still need to flush the superblock update work here, > >>>>>>> otherwise I guess the race condition you mentioned in v1 could still > >>>>>>> occur. > >>>>>>> > >>>>>>> ext4_put_super() > >>>>>>> flush_work(&sbi->s_sb_upd_work) > >>>>>>> > >>>>>>> **kjournald2** > >>>>>>> jbd2_journal_commit_transaction() > >>>>>>> ... > >>>>>>> ext4_inode_error() > >>>>>>> /* JBD2_UNMOUNT not set */ > >>>>>>> schedule_work(s_sb_upd_work) > >>>>>>> > >>>>>>> **workqueue** > >>>>>>> update_super_work > >>>>>>> /* s_journal_destorying is not set */ > >>>>>>> if (journal && !s_journal_destorying) > >>>>>>> > >>>>>>> ext4_journal_destroy() > >>>>>>> /* set s_journal_destorying */ > >>>>>>> sbi->s_journal_destorying = true; > >>>>>>> jbd2_journal_destroy() > >>>>>>> journal->j_flags |= JBD2_UNMOUNT; > >>>>>>> > >>>>>>> jbd2_journal_start() > >>>>>>> start_this_handle() > >>>>>>> BUG_ON(JBD2_UNMOUNT) > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Yi. > >>>>>> Hi Yi, > >>>>>> > >>>>>> Yes you are right, somehow missed this edge case :( > >>>>>> > >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the > >>>>>> helper. Just wondering if I should still let it be in > >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false > >>>>>> before schedule_work(s_sb_upd_work), because it makes sense. > >>>>>> > >>>>>> Okay let me give it some thought but thanks for pointing this out! > >>>>>> > >>>>>> Regards, > >>>>>> ojaswin > >>>>> > >>>>> Okay so thinking about it a bit more, I see you also suggested to flush > >>>>> the work after marking sbi->s_journal_destroying. But will that solve > >>>>> it? > >>>>> > >>>>> ext4_put_super() > >>>>> flush_work(&sbi->s_sb_upd_work) > >>>>> > >>>>> **kjournald2** > >>>>> jbd2_journal_commit_transaction() > >>>>> ... > >>>>> ext4_inode_error() > >>>>> /* JBD2_UNMOUNT not set */ > >>>>> schedule_work(s_sb_upd_work) > >>>>> > >>>>> **workqueue** > >>>>> update_super_work > >>>>> /* s_journal_destorying is not set */ > >>>>> if (journal && !s_journal_destorying) > >>>>> > >>>>> ext4_journal_destroy() > >>>>> /* set s_journal_destorying */ > >>>>> sbi->s_journal_destorying = true; > >>>>> flush_work(&sbi->s_sb_upd_work) > >>>>> schedule_work() > >>>> ^^^^^^^^^^^^^^^ > >>>> where does this come from? > >>>> > >>>> After this flush_work, we can guarantee that the running s_sb_upd_work > >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will > >>>> not commit transaction or call schedule_work() again because it has > >>>> been aborted due to the previous error. Am I missing something? > >>>> > >>>> Thanks, > >>>> Yi. > >>> > >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where > >>> > >>> if(journal && !is_journal_destroying) > >>> > >>> is computed but schedule_work() not called yet, which is possible cause > >>> the cmp followed by jump is not atomic in nature. If the schedule_work > >>> is only called after we have done the flush then we end up with this: > >>> > >>> if (journal && !s_journal_destorying) > >>> ext4_journal_destroy() > >>> /* set s_journal_destorying */ > >>> sbi->s_journal_destorying = true; > >>> flush_work(&sbi->s_sb_upd_work) > >>> schedule_work() > >>> > >>> Which is possible IMO, although the window is tiny. > >> > >> Yeah, right! > >> Sorry for misread the location where you add the "!s_journal_destorying" > >> check, the graph I provided was in update_super_work(), which was wrong. > > > > Oh right, I also misread your trace but yes as discussed, even > > > > sbi->s_journal_destorying = true; > > flush_work() > > jbd2_journal_destroy() > > > > doesn't work. > > > >> The right one should be: > >> > >> ext4_put_super() > >> flush_work(&sbi->s_sb_upd_work) > >> > >> **kjournald2** > >> jbd2_journal_commit_transaction() > >> ... > >> ext4_inode_error() > >> /* s_journal_destorying is not set */ > >> if (journal && !s_journal_destorying) > >> (schedule_work(s_sb_upd_work)) //can be here > >> > >> ext4_journal_destroy() > >> /* set s_journal_destorying */ > >> sbi->s_journal_destorying = true; > >> jbd2_journal_destroy() > >> journal->j_flags |= JBD2_UNMOUNT; > >> > >> (schedule_work(s_sb_upd_work)) //also can be here > >> > >> **workqueue** > >> update_super_work() > >> journal = sbi->s_journal //get journal > >> kfree(journal) > >> jbd2_journal_start(journal) //journal UAF > >> start_this_handle() > >> BUG_ON(JBD2_UNMOUNT) //bugon here > >> > >> > >> So there are two problems here, the first one is the 'journal' UAF, > >> the second one is triggering JBD2_UNMOUNT flag BUGON. > > > > Indeed, there's a possible UAF here as well. > > > >> > >>>>> > >>>>> As for the fix, how about we do something like this: > >>>>> > >>>>> ext4_put_super() > >>>>> > >>>>> flush_work(&sbi->s_sb_upd_work) > >>>>> destroy_workqueue(sbi->rsv_conversion_wq); > >>>>> > >>>>> ext4_journal_destroy() > >>>>> /* set s_journal_destorying */ > >>>>> sbi->s_journal_destorying = true; > >>>>> > >>>>> /* trigger a commit and wait for it to complete */ > >>>>> > >>>>> flush_work(&sbi->s_sb_upd_work) > >>>>> > >>>>> jbd2_journal_destroy() > >>>>> journal->j_flags |= JBD2_UNMOUNT; > >>>>> > >>>>> jbd2_journal_start() > >>>>> start_this_handle() > >>>>> BUG_ON(JBD2_UNMOUNT) > >>>>> > >>>>> Still giving this codepath some thought but seems like this might just > >>>>> be enough to fix the race. Thoughts on this? > >>>>> > >> > >> I think this solution should work, the forced commit and flush_work() > >> should ensure that the last transaction is committed and that the > >> potential work is done. > >> > >> Besides, the s_journal_destorying flag is set and check concurrently > >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what > >> about adding a new flag into sbi->s_mount_state instead of adding > >> new s_journal_destorying? > > > > Right, that makes sence. I will incorporate these changes in the next > > revision. > > > > Think about this again, it seems that we no longer need the destroying > flag. Because we force to commit and wait for the **last** transaction to > complete, and the flush work should also ensure that the last sb_update > work to complete. Regardless of whether it starts a new handle in the > last update_super_work(), it will not commit since the journal should > have aborted. What are your thoughts? > > ext4_put_super() > flush_work(&sbi->s_sb_upd_work) > destroy_workqueue(sbi->rsv_conversion_wq) > > ext4_journal_destroy() > /* trigger a commit (it will commit the last trnasaction) */ > > **kjournald2** > jbd2_journal_commit_transaction() > ... > ext4_inode_error() > schedule_work(s_sb_upd_work)) > > **workqueue** > update_super_work() > jbd2_journal_start(journal) > start_this_handle() > //This new trans will > //not be committed. > > jbd2_journal_abort() > > /* wait for it to complete */ > > flush_work(&sbi->s_sb_upd_work) > jbd2_journal_destroy() > journal->j_flags |= JBD2_UNMOUNT; > jbd2_journal_commit_transaction() //it will commit nothing > > Thanks, > Yi. Hi Yi, There's one more path for which we need the flag: ext4_journal_destroy() /* trigger a commit (it will commit the last trnasaction) */ **kjournald2** jbd2_journal_commit_transaction() journal->j_commit_callback() ext4_journal_commit_callback() ext4_maybe_update_superblock() schedule_work() /* start a transaction here */ flush_work() jbd2_journal_destroy() journal_kill_thread flags |= JBD2_UNMOUNT jbd2_journal_commit_transaction() ... ext4_inode_error() schedule_work(s_sb_upd_work)) /* update_super_work_tries to start the txn */ BUG_ON() I think this to protect against this path we do need a flag. Regards, ojaswin >