Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying

Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> · Sat, 8 Mar 2025 15:28:46 +0530

On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote:
> On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
> > On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> > >> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> > >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> > >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> > >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> > >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> > >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
> > >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
> > >>>>>>>> possible for update_super_work to start a transaction after the journal is
> > >>>>>>>> destroyed eg:
> > >>>>>>>>
> > >>>>>>>> (umount)
> > >>>>>>>> ext4_kill_sb
> > >>>>>>>>   kill_block_super
> > >>>>>>>>     generic_shutdown_super
> > >>>>>>>>       sync_filesystem /* commits all txns */
> > >>>>>>>>       evict_inodes
> > >>>>>>>>         /* might start a new txn */
> > >>>>>>>>       ext4_put_super
> > >>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > >>>>>>>>         jbd2_journal_destroy
> > >>>>>>>>           journal_kill_thread
> > >>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>>>>           jbd2_journal_commit_transaction
> > >>>>>>>>             jbd2_journal_get_descriptor_buffer
> > >>>>>>>>               jbd2_journal_bmap
> > >>>>>>>>                 ext4_journal_bmap
> > >>>>>>>>                   ext4_map_blocks
> > >>>>>>>>                     ...
> > >>>>>>>>                     ext4_inode_error
> > >>>>>>>>                       ext4_handle_error
> > >>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
> > >>>>>>>>
> > >>>>>>>>                                                /* work queue kicks in */
> > >>>>>>>>                                                update_super_work
> > >>>>>>>>                                                  jbd2_journal_start
> > >>>>>>>>                                                    start_this_handle
> > >>>>>>>>                                                      BUG_ON(journal->j_flags &
> > >>>>>>>>                                                             JBD2_UNMOUNT)
> > >>>>>>>>
> > >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
> > >>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
> > >>>>>>>>
> > >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
> > >>>>>>>> during ext4_put_super() (except a running transaction that will get commited
> > >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > >>>>>>>> outside the journal as it won't race with a journaled update (refer
> > >>>>>>>> 2d01ddc86606).
> > >>>>>>>>
> > >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
> > >>>>>>>>
> > >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
> > >>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
> > >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> > >>>>>>>> ---
> > >>>>>>>>  fs/ext4/ext4.h      | 2 ++
> > >>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
> > >>>>>>>>  fs/ext4/super.c     | 4 +++-
> > >>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
> > >>>>>>>>
> > >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
> > >>>>>>>> --- a/fs/ext4/ext4.h
> > >>>>>>>> +++ b/fs/ext4/ext4.h
> > >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > >>>>>>>>  	 */
> > >>>>>>>>  	struct work_struct s_sb_upd_work;
> > >>>>>>>>  
> > >>>>>>>> +	bool s_journal_destorying;
> > >>>>>>>> +
> > >>>>>>>>  	/* Atomic write unit values in bytes */
> > >>>>>>>>  	unsigned int s_awu_min;
> > >>>>>>>>  	unsigned int s_awu_max;
> > >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
> > >>>>>>>> --- a/fs/ext4/ext4_jbd2.h
> > >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
> > >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > >>>>>>>>  {
> > >>>>>>>>  	int err = 0;
> > >>>>>>>>  
> > >>>>>>>> +	/*
> > >>>>>>>> +	 * At this point all pending FS updates should be done except a possible
> > >>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > >>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
> > >>>>>>>> +	 * than going via journal.
> > >>>>>>>> +	 */
> > >>>>>>>> +	sbi->s_journal_destorying = true;
> > >>>>>>>> +
> > >>>>>>>
> > >>>>>>> Hi, Ojaswin!
> > >>>>>>>
> > >>>>>>> I'm afraid you still need to flush the superblock update work here,
> > >>>>>>> otherwise I guess the race condition you mentioned in v1 could still
> > >>>>>>> occur.
> > >>>>>>>
> > >>>>>>>  ext4_put_super()
> > >>>>>>>   flush_work(&sbi->s_sb_upd_work)
> > >>>>>>>
> > >>>>>>>                     **kjournald2**
> > >>>>>>>                     jbd2_journal_commit_transaction()
> > >>>>>>>                     ...
> > >>>>>>>                     ext4_inode_error()
> > >>>>>>>                       /* JBD2_UNMOUNT not set */
> > >>>>>>>                       schedule_work(s_sb_upd_work)
> > >>>>>>>
> > >>>>>>>                                   **workqueue**
> > >>>>>>>                                    update_super_work
> > >>>>>>>                                    /* s_journal_destorying is not set */
> > >>>>>>>                             	   if (journal && !s_journal_destorying)
> > >>>>>>>
> > >>>>>>>   ext4_journal_destroy()
> > >>>>>>>    /* set s_journal_destorying */
> > >>>>>>>    sbi->s_journal_destorying = true;
> > >>>>>>>    jbd2_journal_destroy()
> > >>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>>>
> > >>>>>>>                                        jbd2_journal_start()
> > >>>>>>>                                         start_this_handle()
> > >>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Yi.
> > >>>>>> Hi Yi,
> > >>>>>>
> > >>>>>> Yes you are right, somehow missed this edge case :(
> > >>>>>>
> > >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
> > >>>>>> helper. Just wondering if I should still let it be in
> > >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
> > >>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
> > >>>>>>
> > >>>>>> Okay let me give it some thought but thanks for pointing this out!
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> ojaswin
> > >>>>>
> > >>>>> Okay so thinking about it a bit more, I see you also suggested to flush
> > >>>>> the work after marking sbi->s_journal_destroying. But will that solve
> > >>>>> it?
> > >>>>>
> > >>>>>   ext4_put_super()
> > >>>>>    flush_work(&sbi->s_sb_upd_work)
> > >>>>>  
> > >>>>>                      **kjournald2**
> > >>>>>                      jbd2_journal_commit_transaction()
> > >>>>>                      ...
> > >>>>>                      ext4_inode_error()
> > >>>>>                        /* JBD2_UNMOUNT not set */
> > >>>>>                        schedule_work(s_sb_upd_work)
> > >>>>>  
> > >>>>>                                     **workqueue**
> > >>>>>                                     update_super_work
> > >>>>>                                     /* s_journal_destorying is not set */
> > >>>>>                              	      if (journal && !s_journal_destorying)
> > >>>>>  
> > >>>>>    ext4_journal_destroy()
> > >>>>>     /* set s_journal_destorying */
> > >>>>>     sbi->s_journal_destorying = true;
> > >>>>>     flush_work(&sbi->s_sb_upd_work)
> > >>>>>                                       schedule_work()
> > >>>>                                         ^^^^^^^^^^^^^^^
> > >>>>                                         where does this come from?
> > >>>>
> > >>>> After this flush_work, we can guarantee that the running s_sb_upd_work
> > >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> > >>>> not commit transaction or call schedule_work() again because it has
> > >>>> been aborted due to the previous error. Am I missing something?
> > >>>>
> > >>>> Thanks,
> > >>>> Yi.
> > >>>
> > >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
> > >>>
> > >>>  if(journal && !is_journal_destroying) 
> > >>>
> > >>> is computed but schedule_work() not called yet, which is possible cause
> > >>> the cmp followed by jump is not atomic in nature. If the schedule_work
> > >>> is only called after we have done the flush then we end up with this:
> > >>>
> > >>>                               	      if (journal && !s_journal_destorying)
> > >>>     ext4_journal_destroy()
> > >>>      /* set s_journal_destorying */
> > >>>      sbi->s_journal_destorying = true;
> > >>>      flush_work(&sbi->s_sb_upd_work)
> > >>>                                        schedule_work()
> > >>>
> > >>> Which is possible IMO, although the window is tiny.
> > >>
> > >> Yeah, right!
> > >> Sorry for misread the location where you add the "!s_journal_destorying"
> > >> check, the graph I provided was in update_super_work(), which was wrong.
> > > 
> > > Oh right, I also misread your trace but yes as discussed, even 
> > > 
> > >     sbi->s_journal_destorying = true;
> > > 		flush_work()
> > >     jbd2_journal_destroy()
> > > 
> > > doesn't work.
> > > 
> > >> The right one should be:
> > >>
> > >>  ext4_put_super()
> > >>   flush_work(&sbi->s_sb_upd_work)
> > >>
> > >>                     **kjournald2**
> > >>                     jbd2_journal_commit_transaction()
> > >>                     ...
> > >>                     ext4_inode_error()
> > >>                       /* s_journal_destorying is not set */
> > >>                       if (journal && !s_journal_destorying)
> > >>                         (schedule_work(s_sb_upd_work))  //can be here
> > >>
> > >>   ext4_journal_destroy()
> > >>    /* set s_journal_destorying */
> > >>    sbi->s_journal_destorying = true;
> > >>    jbd2_journal_destroy()
> > >>     journal->j_flags |= JBD2_UNMOUNT;
> > >>
> > >>                         (schedule_work(s_sb_upd_work))  //also can be here
> > >>
> > >>                                   **workqueue**
> > >>                                    update_super_work()
> > >>                                    journal = sbi->s_journal //get journal
> > >>     kfree(journal)
> > >>                                      jbd2_journal_start(journal) //journal UAF
> > >>                                        start_this_handle()
> > >>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
> > >>
> > >>
> > >> So there are two problems here, the first one is the 'journal' UAF,
> > >> the second one is triggering JBD2_UNMOUNT flag BUGON.
> > > 
> > > Indeed, there's a possible UAF here as well.
> > > 
> > >>
> > >>>>>
> > >>>>> As for the fix, how about we do something like this:
> > >>>>>
> > >>>>>   ext4_put_super()
> > >>>>>
> > >>>>>    flush_work(&sbi->s_sb_upd_work)
> > >>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
> > >>>>>
> > >>>>>    ext4_journal_destroy()
> > >>>>>     /* set s_journal_destorying */
> > >>>>>     sbi->s_journal_destorying = true;
> > >>>>>
> > >>>>>    /* trigger a commit and wait for it to complete */
> > >>>>>
> > >>>>>     flush_work(&sbi->s_sb_upd_work)
> > >>>>>
> > >>>>>     jbd2_journal_destroy()
> > >>>>>      journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>  
> > >>>>>                                         jbd2_journal_start()
> > >>>>>                                          start_this_handle()
> > >>>>>                                            BUG_ON(JBD2_UNMOUNT)
> > >>>>>
> > >>>>> Still giving this codepath some thought but seems like this might just
> > >>>>> be enough to fix the race. Thoughts on this?
> > >>>>>
> > >>
> > >> I think this solution should work, the forced commit and flush_work()
> > >> should ensure that the last transaction is committed and that the
> > >> potential work is done.
> > >>
> > >> Besides, the s_journal_destorying flag is set and check concurrently
> > >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> > >> about adding a new flag into sbi->s_mount_state instead of adding
> > >> new s_journal_destorying?
> > > 
> > > Right, that makes sence. I will incorporate these changes in the next 
> > > revision.
> > > 
> > 
> > Think about this again, it seems that we no longer need the destroying
> > flag. Because we force to commit and wait for the **last** transaction to
> > complete, and the flush work should also ensure that the last sb_update
> > work to complete. Regardless of whether it starts a new handle in the
> > last update_super_work(), it will not commit since the journal should
> > have aborted. What are your thoughts?
> > 
> >  ext4_put_super()
> >   flush_work(&sbi->s_sb_upd_work)
> >   destroy_workqueue(sbi->rsv_conversion_wq)
> > 
> >   ext4_journal_destroy()
> >    /* trigger a commit (it will commit the last trnasaction) */
> > 
> >                     **kjournald2**
> >                     jbd2_journal_commit_transaction()
> >                     ...
> >                      ext4_inode_error()
> >                       schedule_work(s_sb_upd_work))
> > 
> >                                      **workqueue**
> >                                       update_super_work()
> >                                         jbd2_journal_start(journal)
> >                                           start_this_handle()
> >                                           //This new trans will
> >                                           //not be committed.
> > 
> >                      jbd2_journal_abort()
> > 
> >    /* wait for it to complete */
> > 
> >    flush_work(&sbi->s_sb_upd_work)
> >    jbd2_journal_destroy()
> >     journal->j_flags |= JBD2_UNMOUNT;
> >    jbd2_journal_commit_transaction() //it will commit nothing
> > 
> > Thanks,
> > Yi.
> 
> Hi Yi,
> 
> There's one more path for which we need the flag:
> 
>    ext4_journal_destroy()
>     /* trigger a commit (it will commit the last trnasaction) */
>  
>                      **kjournald2**
>                      jbd2_journal_commit_transaction()
> 										 	journal->j_commit_callback()
> 											  ext4_journal_commit_callback()
> 												  ext4_maybe_update_superblock()
> 													  schedule_work()
> 			/* start a transaction here */
> 			flush_work()
> 			  jbd2_journal_destroy()
> 				  journal_kill_thread
> 					  flags |= JBD2_UNMOUNT
> 				  jbd2_journal_commit_transaction()
> 					  ...
>             ext4_inode_error()
> 							schedule_work(s_sb_upd_work))
> 							/* update_super_work_tries to start the txn */
> 							BUG_ON()

Oops the formatting is wrong, here's the trace:

 ext4_journal_destroy()
    /* trigger a commit (it will commit the last trnasaction) */

                   **kjournald2**
                   jbd2_journal_commit_transaction()
                    journal->j_commit_callback()
                      ext4_journal_commit_callback()
                        ext4_maybe_update_superblock()
                          schedule_work()

    /* update_super_work starts a new txn here */
    flush_work()
    jbd2_journal_destroy()
      journal_kill_thread
        flags |= JBD2_UNMOUNT
      jbd2_journal_commit_transaction()
        ...
        ext4_inode_error()
          schedule_work(s_sb_upd_work))
          /* update_super_work_tries to start the txn */
          BUG_ON()

> 
> I think this to protect against this path we do need a flag. 
> 
> Regards,
> ojaswin
> >