On Tue, Mar 04, 2025 at 11:25:02AM +0100, Jan Kara wrote: > On Thu 27-02-25 16:50:22, Ojaswin Mujoo wrote: > > On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote: > > > On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote: > > > > Presently we always BUG_ON if trying to start a transaction on a journal > > > > marked with JBD2_UNMOUNT, since this should never happen. However while > > > > running stress tests it was observed that in case of some error handling > > > > paths, it is possible for update_super_work to start a transaction after > > > > the journal is destroyed eg: > > > > > > > > (umount) > > > > ext4_kill_sb > > > > kill_block_super > > > > generic_shutdown_super > > > > sync_filesystem /* commits all txns */ > > > > evict_inodes > > > > /* might start a new txn */ > > > > ext4_put_super > > > > flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ > > > > jbd2_journal_destroy > > > > journal_kill_thread > > > > journal->j_flags |= JBD2_UNMOUNT; > > > > jbd2_journal_commit_transaction > > > > jbd2_journal_get_descriptor_buffer > > > > jbd2_journal_bmap > > > > ext4_journal_bmap > > > > ext4_map_blocks > > > > ... > > > > ext4_inode_error > > > > ext4_handle_error > > > > schedule_work(&sbi->s_sb_upd_work) > > > > > > > > /* work queue kicks in */ > > > > update_super_work > > > > jbd2_journal_start > > > > start_this_handle > > > > BUG_ON(journal->j_flags & > > > > JBD2_UNMOUNT) > > > > > > > > Hence, make sure we only defer the update of ext4 sb if the sb is still > > > > active. Otherwise, just fallback to an un-journaled commit. > > > > > > > > The important thing to note here is that we must only defer sb update if > > > > we have not yet flushed the s_sb_update_work queue in umount path else > > > > this race can be hit (point 1 below). Since we don't have a direct way > > > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit > > > > subtle so adding some notes below for future reference: > > > > > > > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT > > > > == 0) however this is not correct since we could end up scheduling work > > > > after it has been flushed: > > > > > > > > ext4_put_super > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > > **kjournald2** > > > > jbd2_journal_commit_transaction > > > > ... > > > > ext4_inode_error > > > > /* JBD2_UNMOUNT not set */ > > > > schedule_work(s_sb_upd_work) > > > > > > > > jbd2_journal_destroy > > > > journal->j_flags |= JBD2_UNMOUNT; > > > > > > > > **workqueue** > > > > update_super_work > > > > jbd2_journal_start > > > > start_this_handle > > > > BUG_ON(JBD2_UNMOUNT) > > > > > > > > Something like the above doesn't happen with SB_ACTIVE check because we > > > > are sure that the workqueue would be flushed at a later point if we are > > > > in the umount path. > > > > > > > > 2. We don't need a similar check in ext4_grp_locked_error since it is > > > > only called from mballoc and AFAICT it would be always valid to schedule > > > > work here. > > > > > > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") > > > > Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> > > > > Suggested-by: Ritesh Harjani <ritesh.list@xxxxxxxxx> > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > > > > > Good catch! But I think the solution will have to be slightly different. > > > Basing the check on SB_ACTIVE has the problem that you can have racing > > > updates of the sb in the still running transaction and in your direct > > > update leading to inconsistencies after a crash (that was the reason why > > > we've created the s_sb_upd_work in the first place). > > > > > > I would solve this by implementing something like > > > ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the > > > workqueue, and then destroy the journal. And ext4_handle_error() will check > > > for the sbi flag. > > > > > > Honza > > > > Hey Jan, > > > > Thanks for the review. So earlier I did go through different code paths to see > > if we will have a direct sb write clash with a journalled one it wouldn't but, > > relooking at it, seems like we might have a scenario as follows: > > > > generic_super_shutdown > > sync_filesytems > > /* running txns committed. executing ext4_journal_commit_callback */ > > ext4_maybe_update_superblock > > /* schedules work */ > > schedule_work(&sbi->s_sb_upd_work) > > update_super_work > > /* start a txn and add sb to it */ > > sb->s_flags &= ~SB_ACTIVE; > > evict_inode > > ext4_evict_inode > > ext4_std_error > > ext4_handle_error > > /* direct commit of sb (Not good!) */ > > > > > > Now with the 'setting the flag in sbi' approach, I'm not sure if that will be > > enough to handle this as well. For example, if we add a flag like > > sbi->s_journal_destroying, then: > > > > ext4_put_super > > sbi->s_journal_destroying = true > > flush_workqueue() > > /* sb is now journalled */ > > jbd2_journal_destory > > jbd2_journal_commit_transaction > > /* add tag for sb in descriptor and add buffer to wbufs[] */ > > /* Later from some other buffer in the txn: */ > > jbd2_journal_next_log_block > > /* hits error in ext4_journal_bmap */ > > ext4_handle_error > > sbi->s_journal_destroying == true > > /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */ > > jbd2_journal_abort > > /* after abort everything in wbufs[] is written to journal */ > > > > In the above we will have a checksum mismatch but then maybe its not really > > an issue. Maybe since we never commit the txn it is understood that the contents > > can't be trusted and it should be fine to have a mismatch b/w the decriptor tag > > and the actual super block contents? In which case the sbi flag approach should > > be fine. > > > > Does my understanding sound correct? > > Yes. Since the transaction does not get committed, its contents will be > (and must be) ignored. So although you are correct that the superblock > content in the transaction need to match the content we write directly, it > does not matter because whatever is in the uncommitted transaction must > never be written to the final position on disk. > > Honza Got it, thanks for the review. I'll address this in v2 Regards, ojaswin > > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR