On Fri 08-09-23 20:02:36, Zhihao Cheng wrote: > 在 2023/9/8 19:14, Jan Kara 写道: > Hi Jan, > > On Fri 08-09-23 17:28:08, Zhihao Cheng wrote: > > > JBD2 makes sure journal data is fallen on fs device by sync_blockdev(), > > > however, other process could intercept the EIO information from bdev's > > > mapping, which leads journal recovering successful even EIO occurs during > > > data written back to fs device. > > > > > > We found this problem in our product, iscsi + multipath is chosen for block > > > device of ext4. Unstable network may trigger kpartx to rescan partitions in > > > device mapper layer. Detailed process is shown as following: > > > > > > mount kpartx irq > > > jbd2_journal_recover > > > do_one_pass > > > memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal > > > mark_buffer_dirty // mark bh dirty > > > vfs_read > > > generic_file_read_iter // dio > > > filemap_write_and_wait_range > > > __filemap_fdatawrite_range > > > do_writepages > > > block_write_full_folio > > > submit_bh_wbc > > > >> EIO occurs in disk << > > > end_buffer_async_write > > > mark_buffer_write_io_error > > > mapping_set_error > > > set_bit(AS_EIO, &mapping->flags) // set! > > > filemap_check_errors > > > test_and_clear_bit(AS_EIO, &mapping->flags) // clear! > > > err2 = sync_blockdev > > > filemap_write_and_wait > > > filemap_check_errors > > > test_and_clear_bit(AS_EIO, &mapping->flags) // false > > > err2 = 0 > > > > > > Filesystem is mounted successfully even data from journal is failed written > > > into disk, and ext4/ocfs2 could become corrupted. > > > > > > Fix it by comparing the wb_err state in fs block device before recovering > > > and after recovering. > > > > > > Fetch a reproducer in [Link]. > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888 > > > Cc: stable@xxxxxxxxxxxxxxx > > > Signed-off-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx> > > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> > > > > Thanks for the patch! It makes sense but it is somewhat inconsistent with > > how we deal with other checks for metadata IO errors in ext4. We do those > > checks in ext4 through ext4_check_bdev_write_error(). So I wonder if in > > this case we shouldn't move the errseq_check_and_advance() in > > __ext4_fill_super() earlier (before journal setup) and then use it in > > ext4_load_and_init_journal() to detect errors during background metadata > > writeback. What do you think? > > > > Do you mean that modify like this? > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index 38217422f938..3f6239f8cc4e 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -4907,6 +4907,13 @@ static int ext4_load_and_init_journal(struct > super_block *sb, > if (err) > return err; > > + err = > errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err, > + &sbi->s_bdev_wb_err); > + if (err) { > + ext4_msg(sb, KERN_ERR, "Failed to sync fs block device"); Maybe make the message (..., "Background error %d when loading journal", err) > + goto out; > + } > + > if (ext4_has_feature_64bit(sb) && > !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0, > JBD2_FEATURE_INCOMPAT_64BIT)) { > @@ -5365,6 +5372,13 @@ static int __ext4_fill_super(struct fs_context *fc, > struct super_block *sb) > goto failed_mount3a; > } > > + /* > + * Save the original bdev mapping's wb_err value which could be > + * used to detect the metadata async write error. > + */ > + spin_lock_init(&sbi->s_bdev_wb_lock); > + errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err, > + &sbi->s_bdev_wb_err); > err = -EINVAL; > /* > * The first inode we look at is the journal inode. Don't try > @@ -5571,13 +5585,6 @@ static int __ext4_fill_super(struct fs_context *fc, > struct super_block *sb) > } > #endif /* CONFIG_QUOTA */ > > - /* > - * Save the original bdev mapping's wb_err value which could be > - * used to detect the metadata async write error. > - */ > - spin_lock_init(&sbi->s_bdev_wb_lock); > - errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err, > - &sbi->s_bdev_wb_err); > EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS; > ext4_orphan_cleanup(sb, es); > EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS; Yes, this looks good to me. > If so, there are two points: > 1. ocfs2 also uses jbd2, we only fix ext4. Yes, but ocfs2 also currently doesn't handle any IO errors due to failed background writeback of buffers needing checkpointing which is IMO a much more probable issue than a problem during journal replay. So yes, we can think whether we don't want to move all the background EIO handling into jbd2 but for this particular case I think the inconsistency is worse than the benefit of fixing ocfs2. > 2. EIO from ext4_commit_super() in ext4_load_journal is ignored, now ext4 > will fail mounting. Yes, but why should we continue mounting when we cannot write to the superblock? I see no real benefit... Honza > > > --- > > > fs/jbd2/recovery.c | 7 +++++++ > > > 1 file changed, 7 insertions(+) > > > > > > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c > > > index c269a7d29a46..0fecaa6a3ac6 100644 > > > --- a/fs/jbd2/recovery.c > > > +++ b/fs/jbd2/recovery.c > > > @@ -289,6 +289,8 @@ int jbd2_journal_recover(journal_t *journal) > > > journal_superblock_t * sb; > > > struct recovery_info info; > > > + errseq_t wb_err; > > > + struct address_space *mapping; > > > memset(&info, 0, sizeof(info)); > > > sb = journal->j_superblock; > > > @@ -306,6 +308,8 @@ int jbd2_journal_recover(journal_t *journal) > > > return 0; > > > } > > > + mapping = journal->j_fs_dev->bd_inode->i_mapping; > > > + errseq_check_and_advance(&mapping->wb_err, &wb_err); > > > err = do_one_pass(journal, &info, PASS_SCAN); > > > if (!err) > > > err = do_one_pass(journal, &info, PASS_REVOKE); > > > @@ -327,6 +331,9 @@ int jbd2_journal_recover(journal_t *journal) > > > jbd2_journal_clear_revoke(journal); > > > err2 = sync_blockdev(journal->j_fs_dev); > > > + if (!err) > > > + err = err2; > > > + err2 = errseq_check_and_advance(&mapping->wb_err, &wb_err); > > > if (!err) > > > err = err2; > > > /* Make sure all replayed data is on permanent storage */ > > > -- > > > 2.39.2 > > > > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR