On 2025/3/13 1:15, Jan Kara wrote: > On Wed 12-03-25 19:56:36, Ojaswin Mujoo wrote: >> On Wed, Mar 12, 2025 at 11:51:03AM +0100, Jan Kara wrote: >>> On Mon 10-03-25 10:13:36, Ritesh Harjani wrote: >>>> Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> writes: >>>>> On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote: >>>>>> Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> writes: >>>>>>> On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote: >>>>>>>> Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> writes: >>>>>>>>> On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote: >>>>>>>>>> Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> writes: >>>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked >>>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running >>>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is >>>>>>>>>>> possible for update_super_work to start a transaction after the journal is >>>>>>>>>>> destroyed eg: >>>>>>>>>>> >>>>>>>>>>> (umount) >>>>>>>>>>> ext4_kill_sb >>>>>>>>>>> kill_block_super >>>>>>>>>>> generic_shutdown_super >>>>>>>>>>> sync_filesystem /* commits all txns */ >>>>>>>>>>> evict_inodes >>>>>>>>>>> /* might start a new txn */ >>>>>>>>>>> ext4_put_super >>>>>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ >>>>>>>>>>> jbd2_journal_destroy >>>>>>>>>>> journal_kill_thread >>>>>>>>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>>>>>>>> jbd2_journal_commit_transaction >>>>>>>>>>> jbd2_journal_get_descriptor_buffer >>>>>>>>>>> jbd2_journal_bmap >>>>>>>>>>> ext4_journal_bmap >>>>>>>>>>> ext4_map_blocks >>>>>>>>>>> ... >>>>>>>>>>> ext4_inode_error >>>>>>>>>>> ext4_handle_error >>>>>>>>>>> schedule_work(&sbi->s_sb_upd_work) >>>>>>>>>>> >>>>>>>>>>> /* work queue kicks in */ >>>>>>>>>>> update_super_work >>>>>>>>>>> jbd2_journal_start >>>>>>>>>>> start_this_handle >>>>>>>>>>> BUG_ON(journal->j_flags & >>>>>>>>>>> JBD2_UNMOUNT) >>>>>>>>>>> >>>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is >>>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not >>>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit. >>>>>>>>>>> >>>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done >>>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited >>>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb >>>>>>>>>>> outside the journal as it won't race with a journaled update (refer >>>>>>>>>>> 2d01ddc86606). >>>>>>>>>>> >>>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only >>>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here. >>>>>>>>>>> >>>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") >>>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> >>>>>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx> >>>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> >>>>>>>>>>> --- >>>>>>>>>>> fs/ext4/ext4.h | 2 ++ >>>>>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++ >>>>>>>>>>> fs/ext4/super.c | 4 +++- >>>>>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >>>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644 >>>>>>>>>>> --- a/fs/ext4/ext4.h >>>>>>>>>>> +++ b/fs/ext4/ext4.h >>>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info { >>>>>>>>>>> */ >>>>>>>>>>> struct work_struct s_sb_upd_work; >>>>>>>>>>> >>>>>>>>>>> + bool s_journal_destorying; >>>>>>>>>>> + >>>>>>>>>>> /* Atomic write unit values in bytes */ >>>>>>>>>>> unsigned int s_awu_min; >>>>>>>>>>> unsigned int s_awu_max; >>>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h >>>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644 >>>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h >>>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h >>>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour >>>>>>>>>>> { >>>>>>>>>>> int err = 0; >>>>>>>>>>> >>>>>>>>>>> + /* >>>>>>>>>>> + * At this point all pending FS updates should be done except a possible >>>>>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It >>>>>>>>>>> + * is now safe for any new errors to directly commit superblock rather >>>>>>>>>>> + * than going via journal. >>>>>>>>>>> + */ >>>>>>>>>>> + sbi->s_journal_destorying = true; >>>>>>>>>> >>>>>>>>>> This is not correct right. I think what we decided to set this flag >>>>>>>>>> before we flush the workqueue. So that we don't schedule any new >>>>>>>>>> work after this flag has been set. At least that is what I understood. >>>>>>>>>> >>>>>>>>>> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@xxxxxxxxx/ >>>>>>>>>> >>>>>>>>>> -ritesh >>>>>>>>> >>>>>>>>> Hey Ritesh, >>>>>>>>> >>>>>>>>> Yes that is not correct, I missed that in my patch however we realised >>>>>>>>> that adding it before flush_work() also has issues [1]. More >>>>>>>>> specifically: >>>>>>>> >>>>>>>> Ohk. right. >>>>>>>> >>>>>>>>> >>>>>>>>> **kjournald2** >>>>>>>>> jbd2_journal_commit_transaction() >>>>>>>>> ... >>>>>>>>> ext4_handle_error() >>>>>>>>> /* s_journal_destorying is not set */ >>>>>>>>> if (journal && !s_journal_destorying) >>>>>>>> >>>>>>>> Then maybe we should not schedule another work to update the superblock >>>>>>>> via journalling, it the error itself occurred while were trying to >>>>>>>> commit the journal txn? >>>>>>>> >>>>>>>> >>>>>>>> -ritesh >>>>>>> >>>>>>> Hmm, ideally yes that should not happen, but how can we achieve that? >>>>>>> For example with the trace we saw: >>>>>>> >>>>>>> **kjournald2** >>>>>>> jbd2_journal_commit_transaction() >>>>>>> jbd2_journal_get_descriptor_buffer >>>>>>> jbd2_journal_bmap >>>>>>> ext4_journal_bmap >>>>>>> ext4_map_blocks >>>>>>> ... >>>>>>> ext4_inode_error >>>>>>> ext4_handle_error >>>>>>> schedule_work(&sbi->s_sb_upd_work) >>>>>>> >>>>>>> How do we tell ext4_handle_error that it is in the context of a >>>>>>> committing txn. >>> >>> So I was thinking about this. It is not a problem to determine we are >>> running in kjournald context - it is enough to check >>> >>> current == EXT4_SB(sb)->s_journal->j_task >> >> Oh, right :) >> >>> >>> But I'm not sure checking this in ext4_handle_error() and doing direct sb >>> update instead of scheduling a journalled one is always correct. For >>> example kjournald does also writeback of ordered data and if that hits an >>> error, we do not necessarily abort the journal (well, currently we do as >>> far as I'm checking but it seems a bit fragile to rely on this). >> >> Okay so IIUC your concern is there might be some codepaths, now or in >> the future, where kjournald might call the FS layer, hit an error and >> still decide to not abort. In which case we would still want to update >> the sb via journal. > > Yeah. The reason why I'm a bit concerned about it is mostly the case of > kjournald also handling ordered data and situations like > !(journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR) where people want to > continue although ordered data had issues. Or situations where something in > j_commit_callback or another jbd2 hook ends up calling ext4_error()... > Ha, right! This is a case where kjournald triggers an ext4 error but does not abort the journal for now, I forgot this one, and there may be more. Thanks for pointing it out. I would also prefer to use this solution of adding ext4_journal_destory(). Thanks, Yi.