Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE

Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> · Tue, 4 Mar 2025 18:24:27 +0530

On Tue, Mar 04, 2025 at 11:25:02AM +0100, Jan Kara wrote:
> On Thu 27-02-25 16:50:22, Ojaswin Mujoo wrote:
> > On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote:
> > > On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
> > > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > > running stress tests it was observed that in case of some error handling
> > > > paths, it is possible for update_super_work to start a transaction after
> > > > the journal is destroyed eg:
> > > > 
> > > > (umount)
> > > > ext4_kill_sb
> > > >   kill_block_super
> > > >     generic_shutdown_super
> > > >       sync_filesystem /* commits all txns */
> > > >       evict_inodes
> > > >         /* might start a new txn */
> > > >       ext4_put_super
> > > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > >         jbd2_journal_destroy
> > > >           journal_kill_thread
> > > >             journal->j_flags |= JBD2_UNMOUNT;
> > > >           jbd2_journal_commit_transaction
> > > >             jbd2_journal_get_descriptor_buffer
> > > >               jbd2_journal_bmap
> > > >                 ext4_journal_bmap
> > > >                   ext4_map_blocks
> > > >                     ...
> > > >                     ext4_inode_error
> > > >                       ext4_handle_error
> > > >                         schedule_work(&sbi->s_sb_upd_work)
> > > > 
> > > >                                                /* work queue kicks in */
> > > >                                                update_super_work
> > > >                                                  jbd2_journal_start
> > > >                                                    start_this_handle
> > > >                                                      BUG_ON(journal->j_flags &
> > > >                                                             JBD2_UNMOUNT)
> > > > 
> > > > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > > > active.  Otherwise, just fallback to an un-journaled commit.
> > > > 
> > > > The important thing to note here is that we must only defer sb update if
> > > > we have not yet flushed the s_sb_update_work queue in umount path else
> > > > this race can be hit (point 1 below). Since we don't have a direct way
> > > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > > > subtle so adding some notes below for future reference:
> > > > 
> > > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > > > == 0) however this is not correct since we could end up scheduling work
> > > > after it has been flushed:
> > > > 
> > > >  ext4_put_super
> > > >   flush_work(&sbi->s_sb_upd_work)
> > > > 
> > > >                            **kjournald2**
> > > >                            jbd2_journal_commit_transaction
> > > >                            ...
> > > >                            ext4_inode_error
> > > >                              /* JBD2_UNMOUNT not set */
> > > >                              schedule_work(s_sb_upd_work)
> > > > 
> > > >    jbd2_journal_destroy
> > > >     journal->j_flags |= JBD2_UNMOUNT;
> > > > 
> > > >                                       **workqueue**
> > > >                                       update_super_work
> > > >                                        jbd2_journal_start
> > > >                                         start_this_handle
> > > >                                           BUG_ON(JBD2_UNMOUNT)
> > > > 
> > > > Something like the above doesn't happen with SB_ACTIVE check because we
> > > > are sure that the workqueue would be flushed at a later point if we are
> > > > in the umount path.
> > > > 
> > > > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > > > only called from mballoc and AFAICT it would be always valid to schedule
> > > > work here.
> > > > 
> > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > > Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
> > > > Suggested-by: Ritesh Harjani <ritesh.list@xxxxxxxxx>
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> > > 
> > > Good catch! But I think the solution will have to be slightly different.
> > > Basing the check on SB_ACTIVE has the problem that you can have racing
> > > updates of the sb in the still running transaction and in your direct
> > > update leading to inconsistencies after a crash (that was the reason why
> > > we've created the s_sb_upd_work in the first place).
> > > 
> > > I would solve this by implementing something like
> > > ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
> > > workqueue, and then destroy the journal. And ext4_handle_error() will check
> > > for the sbi flag.
> > > 
> > > 								Honza
> > 
> > Hey Jan,
> > 
> > Thanks for the review. So earlier I did go through different code paths to see
> > if we will have a direct sb write clash with a journalled one it wouldn't but,
> > relooking at it, seems like we might have a scenario as follows:
> > 
> > generic_super_shutdown
> >  sync_filesytems
> >   /* running txns committed. executing ext4_journal_commit_callback */
> >   ext4_maybe_update_superblock
> >    /* schedules work */
> >    schedule_work(&sbi->s_sb_upd_work)
> >                                           update_super_work
> >                                           /* start a txn and add sb to it */
> >  sb->s_flags &= ~SB_ACTIVE;
> >  evict_inode
> >    ext4_evict_inode
> >     ext4_std_error
> >      ext4_handle_error
> >       /* direct commit of sb (Not good!) */
> > 
> > 
> > Now with the 'setting the flag in sbi' approach, I'm not sure if that will be
> > enough to handle this as well. For example, if we add a flag like
> > sbi->s_journal_destroying, then:
> > 
> > ext4_put_super
> >  sbi->s_journal_destroying = true
> >  flush_workqueue()
> >   /* sb is now journalled */
> >  jbd2_journal_destory
> >   jbd2_journal_commit_transaction
> >    /* add tag for sb in descriptor and add buffer to wbufs[] */
> >    /* Later from some other buffer in the txn: */
> >    jbd2_journal_next_log_block
> >     /* hits error in ext4_journal_bmap */
> >     ext4_handle_error
> >       sbi->s_journal_destroying == true
> >       /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */
> >    jbd2_journal_abort
> >    /* after abort everything in wbufs[] is written to journal */
> > 
> > In the above we will have a checksum mismatch but then maybe its not really
> > an issue. Maybe since we never commit the txn it is understood that the contents
> > can't be trusted and it should be fine to have a mismatch b/w the decriptor tag
> > and the actual super block contents? In which case the sbi flag approach should
> > be fine.
> > 
> > Does my understanding sound correct?
> 
> Yes. Since the transaction does not get committed, its contents will be
> (and must be) ignored. So although you are correct that the superblock
> content in the transaction need to match the content we write directly, it
> does not matter because whatever is in the uncommitted transaction must
> never be written to the final position on disk.
> 
> 								Honza

Got it, thanks for the review. I'll address this in v2

Regards,
ojaswin
> 
> -- 
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR