在 2021/7/23 20:11, Theodore Ts'o 写道:
On Tue, Jul 20, 2021 at 02:24:09PM +0800, yangerkun wrote:
'commit c92dc856848f ("ext4: defer saving error info from atomic
context")' and '2d01ddc86606 ("ext4: save error info to sb through journal
if available")' add s_error_work to fix checksum error problem. But the
error path in ext4_fill_super can lead the follow BUG_ON.
Can you share with me your test case? Your patch will result in the
shrinker potentially not getting released in some error paths (which
will cause other kernel panics), and in any case, once the journal is
destroyed here:
@@ -5173,15 +5173,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
sbi->s_ea_block_cache = NULL;
+failed_mount3a:
+ ext4_es_unregister_shrinker(sbi);
+failed_mount3:
+ flush_work(&sbi->s_error_work);
if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
}
-failed_mount3a:
- ext4_es_unregister_shrinker(sbi);
-failed_mount3:
- flush_work(&sbi->s_error_work);
sbi->s_journal is set to NULL, which means that in
flush_stashed_error_work(), journal will be NULL, which means we won't
call start_this_handle and so this change will not make a difference
given the kernel stack trace in the commit description.
For example, before wo goto failed_mount_wq, we may meet some error and
will goto ext4_handle_error which can call
schedule_work(&EXT4_SB(sb)->s_error_work). So the work may start
concurrent with ext4_fill_super goto failed_mount_wq. There does not
have any lock to protect the concurrent read and modifies for
sbi->s_journal.
Injection fault some delay between jbd2_journal_destory and
sbi->s_journal and
We can injection fault while we do mount and add some delay like follow.
We will get some panic report easily...
failed_mount_wq:
ext4_xattr_destroy_cache(sbi->s_ea_inode_cache);
sbi->s_ea_inode_cache = NULL;
ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
sbi->s_ea_block_cache = NULL;
if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
msleep(300); <---- add some delay
sbi->s_journal = NULL;
}
So we need to make sure to work will exists while we destroy the
journal. It maybe better to fix it by move the flush_work before destory
journal.
Thanks,
- Ted
.