On 2023/6/15 4:37, Theodore Ts'o wrote: > On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote: >> >> Sorry about the regression, I found that this issue is not introduced >> by the first patch in this patch series ("jbd2: recheck chechpointing >> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal >> initialization from journal_get_superblock()") [1] on your dev branch. >> >> The problem is the journal super block had been failed to write out >> due to IO fault, it's uptodate bit was cleared by >> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock(). >> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()-> >> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(), >> unfortunately, the read IO is also fail, so the error handling in >> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead >> to above NULL pointer dereference issue. > > Thanks for looking into this. What I believe you are saying is that > the root cause is that earlier patch, but it is still something about > the patch "jbd2: recheck chechpointing non-dirty buffer" which is > changing the timing enough that we're hitting this buffer (because > without the "recheck checkpointing" patch, I'm not seeing the NULL > pointer dereference. I have send out a separate patch names "jbd2: skip reading super block if it has been verified" to fix above NULL pointer dereference issue, I have been runing ext3 generic/475 about 12hours and have not reproduced the problem again (I will also do more tests later). Please take a look at it. > > As far as the e2fsck bug that was causing it to hang in the ext4/adv > test scenario, the patch was a simple one, and I have also checked in > a test case which was a reliable reproducer of the problem. (See > attached for the patches and more detail.) > > It is really interesting that "recheck checkpointing" patch is making > enough of a difference that it is unmasking these bugs. If you could > take a look at these changes and perhaps think about how this patch > series could be changing the nature of the corruption (specifically, > how symlink inodes referenced from inline directories could be > corupted with "rechecking checkpointing", thus unmasking the > e2fsprogs, I'd really appreciate it. > Sure, we will take a look at it for details. Thanks, Yi.