On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote: > On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote: > > On 2025/3/8 1:26, Ojaswin Mujoo wrote: > > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote: > > >> On 2025/3/7 18:27, Ojaswin Mujoo wrote: > > >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote: > > >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote: > > >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote: > > >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote: > > >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote: > > >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked > > >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running > > >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is > > >>>>>>>> possible for update_super_work to start a transaction after the journal is > > >>>>>>>> destroyed eg: > > >>>>>>>> > > >>>>>>>> (umount) > > >>>>>>>> ext4_kill_sb > > >>>>>>>> kill_block_super > > >>>>>>>> generic_shutdown_super > > >>>>>>>> sync_filesystem /* commits all txns */ > > >>>>>>>> evict_inodes > > >>>>>>>> /* might start a new txn */ > > >>>>>>>> ext4_put_super > > >>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ > > >>>>>>>> jbd2_journal_destroy > > >>>>>>>> journal_kill_thread > > >>>>>>>> journal->j_flags |= JBD2_UNMOUNT; > > >>>>>>>> jbd2_journal_commit_transaction > > >>>>>>>> jbd2_journal_get_descriptor_buffer > > >>>>>>>> jbd2_journal_bmap > > >>>>>>>> ext4_journal_bmap > > >>>>>>>> ext4_map_blocks > > >>>>>>>> ... > > >>>>>>>> ext4_inode_error > > >>>>>>>> ext4_handle_error > > >>>>>>>> schedule_work(&sbi->s_sb_upd_work) > > >>>>>>>> > > >>>>>>>> /* work queue kicks in */ > > >>>>>>>> update_super_work > > >>>>>>>> jbd2_journal_start > > >>>>>>>> start_this_handle > > >>>>>>>> BUG_ON(journal->j_flags & > > >>>>>>>> JBD2_UNMOUNT) > > >>>>>>>> > > >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is > > >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not > > >>>>>>>> set. Otherwise, just fallback to an un-journaled commit. > > >>>>>>>> > > >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done > > >>>>>>>> during ext4_put_super() (except a running transaction that will get commited > > >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb > > >>>>>>>> outside the journal as it won't race with a journaled update (refer > > >>>>>>>> 2d01ddc86606). > > >>>>>>>> > > >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only > > >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here. > > >>>>>>>> > > >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") > > >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> > > >>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx> > > >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > >>>>>>>> --- > > >>>>>>>> fs/ext4/ext4.h | 2 ++ > > >>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++ > > >>>>>>>> fs/ext4/super.c | 4 +++- > > >>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-) > > >>>>>>>> > > >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644 > > >>>>>>>> --- a/fs/ext4/ext4.h > > >>>>>>>> +++ b/fs/ext4/ext4.h > > >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info { > > >>>>>>>> */ > > >>>>>>>> struct work_struct s_sb_upd_work; > > >>>>>>>> > > >>>>>>>> + bool s_journal_destorying; > > >>>>>>>> + > > >>>>>>>> /* Atomic write unit values in bytes */ > > >>>>>>>> unsigned int s_awu_min; > > >>>>>>>> unsigned int s_awu_max; > > >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h > > >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644 > > >>>>>>>> --- a/fs/ext4/ext4_jbd2.h > > >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h > > >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour > > >>>>>>>> { > > >>>>>>>> int err = 0; > > >>>>>>>> > > >>>>>>>> + /* > > >>>>>>>> + * At this point all pending FS updates should be done except a possible > > >>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It > > >>>>>>>> + * is now safe for any new errors to directly commit superblock rather > > >>>>>>>> + * than going via journal. > > >>>>>>>> + */ > > >>>>>>>> + sbi->s_journal_destorying = true; > > >>>>>>>> + > > >>>>>>> > > >>>>>>> Hi, Ojaswin! > > >>>>>>> > > >>>>>>> I'm afraid you still need to flush the superblock update work here, > > >>>>>>> otherwise I guess the race condition you mentioned in v1 could still > > >>>>>>> occur. > > >>>>>>> > > >>>>>>> ext4_put_super() > > >>>>>>> flush_work(&sbi->s_sb_upd_work) > > >>>>>>> > > >>>>>>> **kjournald2** > > >>>>>>> jbd2_journal_commit_transaction() > > >>>>>>> ... > > >>>>>>> ext4_inode_error() > > >>>>>>> /* JBD2_UNMOUNT not set */ > > >>>>>>> schedule_work(s_sb_upd_work) > > >>>>>>> > > >>>>>>> **workqueue** > > >>>>>>> update_super_work > > >>>>>>> /* s_journal_destorying is not set */ > > >>>>>>> if (journal && !s_journal_destorying) > > >>>>>>> > > >>>>>>> ext4_journal_destroy() > > >>>>>>> /* set s_journal_destorying */ > > >>>>>>> sbi->s_journal_destorying = true; > > >>>>>>> jbd2_journal_destroy() > > >>>>>>> journal->j_flags |= JBD2_UNMOUNT; > > >>>>>>> > > >>>>>>> jbd2_journal_start() > > >>>>>>> start_this_handle() > > >>>>>>> BUG_ON(JBD2_UNMOUNT) > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> Yi. > > >>>>>> Hi Yi, > > >>>>>> > > >>>>>> Yes you are right, somehow missed this edge case :( > > >>>>>> > > >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the > > >>>>>> helper. Just wondering if I should still let it be in > > >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false > > >>>>>> before schedule_work(s_sb_upd_work), because it makes sense. > > >>>>>> > > >>>>>> Okay let me give it some thought but thanks for pointing this out! > > >>>>>> > > >>>>>> Regards, > > >>>>>> ojaswin > > >>>>> > > >>>>> Okay so thinking about it a bit more, I see you also suggested to flush > > >>>>> the work after marking sbi->s_journal_destroying. But will that solve > > >>>>> it? > > >>>>> > > >>>>> ext4_put_super() > > >>>>> flush_work(&sbi->s_sb_upd_work) > > >>>>> > > >>>>> **kjournald2** > > >>>>> jbd2_journal_commit_transaction() > > >>>>> ... > > >>>>> ext4_inode_error() > > >>>>> /* JBD2_UNMOUNT not set */ > > >>>>> schedule_work(s_sb_upd_work) > > >>>>> > > >>>>> **workqueue** > > >>>>> update_super_work > > >>>>> /* s_journal_destorying is not set */ > > >>>>> if (journal && !s_journal_destorying) > > >>>>> > > >>>>> ext4_journal_destroy() > > >>>>> /* set s_journal_destorying */ > > >>>>> sbi->s_journal_destorying = true; > > >>>>> flush_work(&sbi->s_sb_upd_work) > > >>>>> schedule_work() > > >>>> ^^^^^^^^^^^^^^^ > > >>>> where does this come from? > > >>>> > > >>>> After this flush_work, we can guarantee that the running s_sb_upd_work > > >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will > > >>>> not commit transaction or call schedule_work() again because it has > > >>>> been aborted due to the previous error. Am I missing something? > > >>>> > > >>>> Thanks, > > >>>> Yi. > > >>> > > >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where > > >>> > > >>> if(journal && !is_journal_destroying) > > >>> > > >>> is computed but schedule_work() not called yet, which is possible cause > > >>> the cmp followed by jump is not atomic in nature. If the schedule_work > > >>> is only called after we have done the flush then we end up with this: > > >>> > > >>> if (journal && !s_journal_destorying) > > >>> ext4_journal_destroy() > > >>> /* set s_journal_destorying */ > > >>> sbi->s_journal_destorying = true; > > >>> flush_work(&sbi->s_sb_upd_work) > > >>> schedule_work() > > >>> > > >>> Which is possible IMO, although the window is tiny. > > >> > > >> Yeah, right! > > >> Sorry for misread the location where you add the "!s_journal_destorying" > > >> check, the graph I provided was in update_super_work(), which was wrong. > > > > > > Oh right, I also misread your trace but yes as discussed, even > > > > > > sbi->s_journal_destorying = true; > > > flush_work() > > > jbd2_journal_destroy() > > > > > > doesn't work. > > > > > >> The right one should be: > > >> > > >> ext4_put_super() > > >> flush_work(&sbi->s_sb_upd_work) > > >> > > >> **kjournald2** > > >> jbd2_journal_commit_transaction() > > >> ... > > >> ext4_inode_error() > > >> /* s_journal_destorying is not set */ > > >> if (journal && !s_journal_destorying) > > >> (schedule_work(s_sb_upd_work)) //can be here > > >> > > >> ext4_journal_destroy() > > >> /* set s_journal_destorying */ > > >> sbi->s_journal_destorying = true; > > >> jbd2_journal_destroy() > > >> journal->j_flags |= JBD2_UNMOUNT; > > >> > > >> (schedule_work(s_sb_upd_work)) //also can be here > > >> > > >> **workqueue** > > >> update_super_work() > > >> journal = sbi->s_journal //get journal > > >> kfree(journal) > > >> jbd2_journal_start(journal) //journal UAF > > >> start_this_handle() > > >> BUG_ON(JBD2_UNMOUNT) //bugon here > > >> > > >> > > >> So there are two problems here, the first one is the 'journal' UAF, > > >> the second one is triggering JBD2_UNMOUNT flag BUGON. > > > > > > Indeed, there's a possible UAF here as well. > > > > > >> > > >>>>> > > >>>>> As for the fix, how about we do something like this: > > >>>>> > > >>>>> ext4_put_super() > > >>>>> > > >>>>> flush_work(&sbi->s_sb_upd_work) > > >>>>> destroy_workqueue(sbi->rsv_conversion_wq); > > >>>>> > > >>>>> ext4_journal_destroy() > > >>>>> /* set s_journal_destorying */ > > >>>>> sbi->s_journal_destorying = true; > > >>>>> > > >>>>> /* trigger a commit and wait for it to complete */ > > >>>>> > > >>>>> flush_work(&sbi->s_sb_upd_work) > > >>>>> > > >>>>> jbd2_journal_destroy() > > >>>>> journal->j_flags |= JBD2_UNMOUNT; > > >>>>> > > >>>>> jbd2_journal_start() > > >>>>> start_this_handle() > > >>>>> BUG_ON(JBD2_UNMOUNT) > > >>>>> > > >>>>> Still giving this codepath some thought but seems like this might just > > >>>>> be enough to fix the race. Thoughts on this? > > >>>>> > > >> > > >> I think this solution should work, the forced commit and flush_work() > > >> should ensure that the last transaction is committed and that the > > >> potential work is done. > > >> > > >> Besides, the s_journal_destorying flag is set and check concurrently > > >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what > > >> about adding a new flag into sbi->s_mount_state instead of adding > > >> new s_journal_destorying? > > > > > > Right, that makes sence. I will incorporate these changes in the next > > > revision. > > > > > > > Think about this again, it seems that we no longer need the destroying > > flag. Because we force to commit and wait for the **last** transaction to > > complete, and the flush work should also ensure that the last sb_update > > work to complete. Regardless of whether it starts a new handle in the > > last update_super_work(), it will not commit since the journal should > > have aborted. What are your thoughts? > > > > ext4_put_super() > > flush_work(&sbi->s_sb_upd_work) > > destroy_workqueue(sbi->rsv_conversion_wq) > > > > ext4_journal_destroy() > > /* trigger a commit (it will commit the last trnasaction) */ > > > > **kjournald2** > > jbd2_journal_commit_transaction() > > ... > > ext4_inode_error() > > schedule_work(s_sb_upd_work)) > > > > **workqueue** > > update_super_work() > > jbd2_journal_start(journal) > > start_this_handle() > > //This new trans will > > //not be committed. > > > > jbd2_journal_abort() > > > > /* wait for it to complete */ > > > > flush_work(&sbi->s_sb_upd_work) > > jbd2_journal_destroy() > > journal->j_flags |= JBD2_UNMOUNT; > > jbd2_journal_commit_transaction() //it will commit nothing > > > > Thanks, > > Yi. > > Hi Yi, > > There's one more path for which we need the flag: > > ext4_journal_destroy() > /* trigger a commit (it will commit the last trnasaction) */ > > **kjournald2** > jbd2_journal_commit_transaction() > journal->j_commit_callback() > ext4_journal_commit_callback() > ext4_maybe_update_superblock() > schedule_work() > /* start a transaction here */ > flush_work() > jbd2_journal_destroy() > journal_kill_thread > flags |= JBD2_UNMOUNT > jbd2_journal_commit_transaction() > ... > ext4_inode_error() > schedule_work(s_sb_upd_work)) > /* update_super_work_tries to start the txn */ > BUG_ON() Oops the formatting is wrong, here's the trace: ext4_journal_destroy() /* trigger a commit (it will commit the last trnasaction) */ **kjournald2** jbd2_journal_commit_transaction() journal->j_commit_callback() ext4_journal_commit_callback() ext4_maybe_update_superblock() schedule_work() /* update_super_work starts a new txn here */ flush_work() jbd2_journal_destroy() journal_kill_thread flags |= JBD2_UNMOUNT jbd2_journal_commit_transaction() ... ext4_inode_error() schedule_work(s_sb_upd_work)) /* update_super_work_tries to start the txn */ BUG_ON() > > I think this to protect against this path we do need a flag. > > Regards, > ojaswin > >