Re: [PATCH] jbd2: skip reading super block if it has been verified

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Thu, 15 Jun 2023 16:22:50 +0800

On 2023/6/15 13:26, Theodore Ts'o wrote:
> On Thu, Jun 15, 2023 at 11:49:41AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> We got a NULL pointer dereference issue below while running generic/475
>> I/O failure pressure test.
> 
> Have you been able to reproduce this failure without the "recheck
> checkpoint" series applied?  I have not, so like with the e2fsck bug
> fix, I can understand how the bug fix worked, but I still don't
> understand why I wasn't seeing until I tried to apply the "recheck
> chekcpoint" and the following patches in that patch series.

Yes, I can reproduce this failure without the "recheck
checkpoint" series applied, I reproduced it in ranges from about 5
minutes to 1 hour on your dev branch(just reset to the parent commit
5404e4738054 "ext4: refactoring to use the unified helper
ext4_quotas_off()") with below fstests config.

# ext4 regression fstests config
[ext4]
export FSTYP=ext4
export TEST_DEV=/dev/pmem0p1
export TEST_DIR=/mnt/test
export SCRATCH_DEV=/dev/pmem0p2
export SCRATCH_MNT=/mnt/scratch
export LOGWRITES_DEV=/dev/vdc1
export SCRATCH_LOGDEV=/dev/vdc2
export MKFS_OPTIONS="-O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_file,^dir_nlink,^extra_isize"

[  315.435845] EXT4-fs (dm-0): previous I/O error to superblock detected
[  315.435877] EXT4-fs (dm-0): I/O error while writing superblock
[  315.435885] EXT4-fs (dm-0): Remounting filesystem read-only
[  315.438261] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  315.453689] #PF: supervisor write access in kernel mode
[  315.454884] #PF: error_code(0x0002) - not-present page
[  315.456048] PGD 139b3b067 P4D 139b3b067 PUD 1538ea067 PMD 0
[  315.456201] EXT4-fs error (device dm-0): __ext4_find_entry:1678: inode #131073: comm fsstress: reading directory lblock 0
[  315.457403] Oops: 0002 [#1] PREEMPT SMP
[  315.457411] CPU: 14 PID: 10107 Comm: fsstress Not tainted 6.4.0-rc5-00054-g5404e4738054 #214
[  315.457416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
[  315.457418] RIP: 0010:jbd2_journal_set_features+0xf4/0x500
[  315.461326] EXT4-fs (dm-0): I/O error while writing superblock
[  315.462073] Code: 48 83 05 5e 32 90 0c 01 48 83 05 f6 05 90 0c 01 4d 8b 74 24 38 e8 dc 6c bc 00 48 83 05 ec 05 90 0c 01 48 83 05 bc 05 90 0c 01 <f0> 49 0f ba 2e 02 0f 92 c0 48 83 05 b3 05 90 0c 01 48 83 05 d5
[  315.462086] RSP: 0018:ffffc900116cbad8 EFLAGS: 00010212
[  315.462103] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
[  315.462107] RDX: 0000000080000000 RSI: ffffffffafd25d54 RDI: 0000000000000001
[  315.462115] RBP: 0000000000000000 R08: ffffffffafd256f0 R09: 0000000000000000
[  315.468526] R10: 642820726f727265 R11: 2073662d34545845 R12: ffff88817e85e800
[  315.468535] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888126d93000
[  315.468548] FS:  00007fda46982b80(0000) GS:ffff888237980000(0000) knlGS:0000000000000000
[  315.468560] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  315.468568] CR2: 0000000000000000 CR3: 00000001398d0000 CR4: 00000000000006e0
[  315.487792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  315.487798] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  315.491048] Call Trace:
[  315.491560]  <TASK>
[  315.494065]  ? show_regs+0x84/0x90
[  315.494089]  ? __die_body+0x22/0x90
[  315.494104]  ? __die+0x35/0x50
[  315.494121]  ? page_fault_oops+0x1d3/0x5e0
[  315.515211]  ? search_bpf_extables+0x85/0xc0
[  315.523168]  ? jbd2_journal_set_features+0xf4/0x500
[  315.524214]  ? search_exception_tables+0x7c/0x90
[  315.525211]  ? kernelmode_fixup_or_oops+0x140/0x1a0
[  315.526370]  ? __bad_area_nosemaphore+0x208/0x350
[  315.527475]  ? mt_find+0x2ab/0x3c0
[  315.528718]  ? __bad_area+0x88/0xc0
[  315.529936]  ? bad_area+0x1a/0x30
[  315.530696]  ? do_user_addr_fault+0xa6d/0xd00
[  315.531550]  ? exc_page_fault+0xe7/0x3b0
[  315.532339]  ? asm_exc_page_fault+0x22/0x30
[  315.533153]  ? jbd2_journal_set_features+0xf4/0x500
[  315.533922]  ? jbd2_journal_set_features+0xe4/0x500
[  315.534636]  jbd2_journal_revoke+0x43/0x330
[  315.535272]  __ext4_forget+0x112/0x2c0
[  315.535804]  ? __find_get_block+0x155/0x5a0
[  315.536443]  ext4_free_blocks+0xbd2/0xf20
[  315.537058]  ? ext4_free_data+0x140/0x210
[  315.538420]  ? ext4_free_branches+0x2d4/0x3a0
[  315.540534]  ext4_free_branches+0x1c9/0x3a0
[  315.542064]  ext4_ind_truncate+0x361/0x3f0
[  315.543304]  ? ext4_discard_preallocations+0x3c1/0x740
[  315.546111]  ext4_truncate+0x4a0/0x710
[  315.547623]  ext4_file_write_iter+0xb8d/0xe90
[  315.548940]  vfs_write+0x20e/0x590
[  315.549986]  ksys_write+0x77/0x160
[  315.552027]  __x64_sys_write+0x1d/0x30
[  315.553492]  do_syscall_64+0x68/0xf0
[  315.554711]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

I also try to accelerate reproduce in about 2 mins through add
delay in jbd2_write_superblock() either applied the "recheck
chekcpoint" patch series or not.

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b5e57735ab3f..90d78fe0fb33 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1623,6 +1623,7 @@ static int journal_reset(journal_t *journal)
  * This function expects that the caller will have locked the journal
  * buffer head, and will return with it unlocked
  */
+#include <linux/delay.h>
 static int jbd2_write_superblock(journal_t *journal, blk_opf_t write_flags)
 {
 	struct buffer_head *bh = journal->j_sb_buffer;
@@ -1659,6 +1660,7 @@ static int jbd2_write_superblock(journal_t *journal, blk_opf_t write_flags)
 	bh->b_end_io = end_buffer_write_sync;
 	submit_bh(REQ_OP_WRITE | write_flags, bh);
 	wait_on_buffer(bh);
+	msleep(10);
 	if (buffer_write_io_error(bh)) {
 		clear_buffer_write_io_error(bh);
 		set_buffer_uptodate(bh);

> 
>> If the journal super block had been read and verified, there is no need
>> to call bh_read() read it again even if it has been failed to written
>> out. So the fix could be simply move buffer_verified(bh) in front of
>> bh_read().
>>
>> Fixes: d9eafe0afafa ("jbd2: factor out journal initialization from journal_get_superblock()")
> 
> That works, but it's worth noting that commit d9eafe0afafa caused the
> failure by removing the check on j_journal_version to determine
> whether the superblock was read or not.  If the journal superblock had
> been previously read, j_journal_version would be either 1 or 2.  If it
> had been zero, then superblock was not read.  So from commit
> d9eafe0afafa:
> 
>  	/* Load journal superblock if it is not loaded yet. */
> -	if (journal->j_format_version == 0 &&
> -	    journal_get_superblock(journal) != 0)
> +	if (journal_get_superblock(journal))
>  		return 0;
>  	if (!jbd2_format_support_feature(journal))
>  		return 0;
> 
> 
> The comment "Load journal superblock if it is not loaded yet." should
> be removed, since it no longer makes sense once the
> "journal->j_format_version == 0" check was removed.

Yes.

> 
> I'll also note that a problem with d9eafe0afafa is that by removing
> the j_format_version check, every time we add a revoke header, and we
> call jbd2_journal_set_features(), this was causing an unconditional
> read of the journal superblock and that unnecessary I/O could slow
> down certain workloads.
> 

Yes, fortunately it is innocuous in general because the journal super
block buffer is always in memory and uptodate, therefore bh_read() does
not submit I/O. It's only affects the fault case about the window in
jbd2_write_superblock() which the journal super block has been failed
to write out and has not been restore to uptodate yet.

Thanks,
Yi.