Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_filestream_select_ag in v6.3-rc3

Pengfei Xu <pengfei.xu@xxxxxxxxx> · Wed, 22 Mar 2023 11:20:55 +0800

Hi Darrick J. Wong,

On 2023-03-21 at 13:46:38 -0700, Darrick J. Wong wrote:
> On Mon, Mar 20, 2023 at 02:50:07PM +0800, Pengfei Xu wrote:
> > Hi Dave Chinner and xfs experts,
> > 
> > Greeting!
> > 
> > There is BUG: unable to handle kernel NULL pointer dereference in
> > xfs_filestream_select_ag in v6.3-rc3:
> > 
> > All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/230319_210525_xfs_filestream_select_ag
> > Reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/repro.c
> 
> How the hell am I supposed to extract the fuzzed disk image for
> analysis?
> 
> Current Google syzbot provides a lot more information for analysis.  Why
> don't you go triage some of their reports instead of spraying more crap
> at the XFS list?
> 
Ah, thanks a lot for your suggestion!
Next time I should add more analysis as follow from syzkaller to all problem
reports.

Updated more info as follow,
More detailed analysis from syzkaller report0: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/report0
repor.stats: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/repro.stats
vm machine info: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/machineInfo0

I newly added repro.report: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/repro.report
"
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (loop0): metadata I/O error in "xfs_read_agf+0xd0/0x2c0" at daddr 0x8001 len 1 error 74
XFS (loop0): page discard on page 00000000b8174cbd, inode 0x46, pos 0.
BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0 
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 34 Comm: kworker/u4:2 Not tainted 6.3.0-rc2-intel-next-38f821ff82e9+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-7:0)
RIP: 0010:arch_atomic_inc arch/x86/include/asm/atomic.h:95 [inline]
RIP: 0010:atomic_inc include/linux/atomic/atomic-instrumented.h:191 [inline]
RIP: 0010:xfs_filestream_create_association fs/xfs/xfs_filestream.c:321 [inline]
RIP: 0010:xfs_filestream_select_ag+0x5d5/0xce0 fs/xfs/xfs_filestream.c:372
Code: 80 ff 49 89 5d 18 be 08 00 00 00 bf 20 00 00 00 e8 80 f9 03 00 48 89 c3 48 85 c0 0f 84 3a 05 00 00 e8 9f 8a 80 ff 49 8b 45 18 <f0> ff 40 10 49 8b 45 18 48 8b 75 b8 48 89 da 48 89 43 18 48 8b 45
RSP: 0018:ffffc900001274c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88800dbeae40 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88800791a340 RDI: 0000000000000002
RBP: ffffc90000127548 R08: ffffc90000127400 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffc90000127588 R14: 0000000000000001 R15: ffffc90000127708
FS:  0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000000b85c002 CR4: 0000000000f70ee0
PKRU: 55555554
Call Trace:
 <TASK>
 xfs_bmap_btalloc_filestreams fs/xfs/libxfs/xfs_bmap.c:3558 [inline]
 xfs_bmap_btalloc+0x706/0xb90 fs/xfs/libxfs/xfs_bmap.c:3672
 xfs_bmap_alloc_userdata fs/xfs/libxfs/xfs_bmap.c:4046 [inline]
 xfs_bmapi_allocate+0x25b/0x5e0 fs/xfs/libxfs/xfs_bmap.c:4089
 xfs_bmapi_convert_delalloc+0x335/0x6c0 fs/xfs/libxfs/xfs_bmap.c:4554
 xfs_convert_blocks fs/xfs/xfs_aops.c:266 [inline]
 xfs_map_blocks+0x2ff/0x8a0 fs/xfs/xfs_aops.c:389
 iomap_writepage_map fs/iomap/buffered-io.c:1641 [inline]
 iomap_do_writepage+0x43f/0x1070 fs/iomap/buffered-io.c:1803
 write_cache_pages+0x2b8/0x8a0 mm/page-writeback.c:2473
 iomap_writepages+0x3e/0x80 fs/iomap/buffered-io.c:1820
 xfs_vm_writepages+0x97/0xe0 fs/xfs/xfs_aops.c:513
 do_writepages+0x10f/0x240 mm/page-writeback.c:2551
 __writeback_single_inode+0x9f/0xb20 fs/fs-writeback.c:1600
 writeback_sb_inodes+0x301/0x8b0 fs/fs-writeback.c:1891
 wb_writeback+0x18b/0x7c0 fs/fs-writeback.c:2065
 wb_do_writeback fs/fs-writeback.c:2208 [inline]
 wb_workfn+0xc0/0xad0 fs/fs-writeback.c:2248
 process_one_work+0x3b1/0x9e0 kernel/workqueue.c:2390
 worker_thread+0x52/0x660 kernel/workqueue.c:2537
 kthread+0x161/0x1a0 kernel/kthread.c:376
 ret_from_fork+0x29/0x50 arch/x86/entry/entry_64.S:308
 </TASK>
Modules linked in:
CR2: 0000000000000010
---[ end trace 0000000000000000 ]---
RIP: 0010:arch_atomic_inc arch/x86/include/asm/atomic.h:95 [inline]
RIP: 0010:atomic_inc include/linux/atomic/atomic-instrumented.h:191 [inline]
RIP: 0010:xfs_filestream_create_association fs/xfs/xfs_filestream.c:321 [inline]
RIP: 0010:xfs_filestream_select_ag+0x5d5/0xce0 fs/xfs/xfs_filestream.c:372
Code: 80 ff 49 89 5d 18 be 08 00 00 00 bf 20 00 00 00 e8 80 f9 03 00 48 89 c3 48 85 c0 0f 84 3a 05 00 00 e8 9f 8a 80 ff 49 8b 45 18 <f0> ff 40 10 49 8b 45 18 48 8b 75 b8 48 89 da 48 89 43 18 48 8b 45
RSP: 0018:ffffc900001274c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88800dbeae40 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88800791a340 RDI: 0000000000000002
RBP: ffffc90000127548 R08: ffffc90000127400 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffc90000127588 R14: 0000000000000001 R15: ffffc90000127708
FS:  0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000000b85c002 CR4: 0000000000f70ee0
PKRU: 55555554
note: kworker/u4:2[34] exited with irqs disabled
------------[ cut here ]------------
WARNING: CPU: 1 PID: 34 at kernel/exit.c:814 do_exit+0xf68/0x1360 kernel/exit.c:814
Modules linked in:
CPU: 1 PID: 34 Comm: kworker/u4:2 Tainted: G      D            6.3.0-rc2-intel-next-38f821ff82e9+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-7:0)
RIP: 0010:do_exit+0xf68/0x1360 kernel/exit.c:814
Code: ff ff e8 2b 7e 1b 00 4c 89 ee bf 05 06 00 00 e8 7e c1 01 00 e9 a7 f2 ff ff e8 14 7e 1b 00 0f 0b e9 f8 f0 ff ff e8 08 7e 1b 00 <0f> 0b e9 60 f1 ff ff e8 fc 7d 1b 00 48 89 df e8 54 ff 1a 00 e9 ec
RSP: 0018:ffffc90000127eb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88800791a340 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff88800791a340 RDI: 0000000000000002
RBP: ffffc90000127f18 R08: 0000000000000000 R09: 0000000000000000
R10: 34752f72656b726f R11: 776b203a65746f6e R12: 0000000000000000
R13: 0000000000000009 R14: ffff8880079292c0 R15: ffff888007924600
FS:  0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000000b85c002 CR4: 0000000000f70ee0
PKRU: 55555554
Call Trace:
 <TASK>
 make_task_dead+0x100/0x290 kernel/exit.c:981
 rewind_stack_and_make_dead+0x17/0x20 arch/x86/entry/entry_64.S:1541
 </TASK>
irq event stamp: 46556
hardirqs last  enabled at (46555): [<ffffffff8218402d>] get_random_u32+0x1dd/0x360 drivers/char/random.c:532
hardirqs last disabled at (46556): [<ffffffff8300582e>] exc_page_fault+0x4e/0x500 arch/x86/mm/fault.c:1551
softirqs last  enabled at (37844): [<ffffffff83029bdc>] softirq_handle_end kernel/softirq.c:414 [inline]
softirqs last  enabled at (37844): [<ffffffff83029bdc>] __do_softirq+0x31c/0x49c kernel/softirq.c:600
softirqs last disabled at (37835): [<ffffffff8112e774>] invoke_softirq kernel/softirq.c:445 [inline]
softirqs last disabled at (37835): [<ffffffff8112e774>] __irq_exit_rcu kernel/softirq.c:650 [inline]
softirqs last disabled at (37835): [<ffffffff8112e774>] irq_exit_rcu+0xc4/0x100 kernel/softirq.c:662
---[ end trace 0000000000000000 ]---
----------------
Code disassembly (best guess):
   0:   80 ff 49                cmp    $0x49,%bh
   3:   89 5d 18                mov    %ebx,0x18(%rbp)
   6:   be 08 00 00 00          mov    $0x8,%esi
   b:   bf 20 00 00 00          mov    $0x20,%edi
  10:   e8 80 f9 03 00          call   0x3f995
  15:   48 89 c3                mov    %rax,%rbx
  18:   48 85 c0                test   %rax,%rax
  1b:   0f 84 3a 05 00 00       je     0x55b
  21:   e8 9f 8a 80 ff          call   0xff808ac5
  26:   49 8b 45 18             mov    0x18(%r13),%rax
* 2a:   f0 ff 40 10             lock incl 0x10(%rax) <-- trapping instruction
  2e:   49 8b 45 18             mov    0x18(%r13),%rax
  32:   48 8b 75 b8             mov    -0x48(%rbp),%rsi
  36:   48 89 da                mov    %rbx,%rdx
  39:   48 89 43 18             mov    %rax,0x18(%rbx)
  3d:   48                      rex.W
  3e:   8b                      .byte 0x8b
  3f:   45                      rex.RB
"

> > Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/kconfig_origin
> > v6.3-rc3 issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/v6.3-rc3_issue_dmesg.log
> > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230319_210525_xfs_filestream_select_ag/bisect_info.log
> > 
> > Bisected between v6.3-rc2 and v5.11 and found the bad commit:
> > "
> > 8ac5b996bf5199f15b7687ceae989f8b2a410dda
> > xfs: fix off-by-one-block in xfs_discard_folio()
> 
> How does *fixing* an off by one error in the page cache produce a crash
> in the filestreams allocator?
> 
  I'm also surprised there is such a problem, I'm not sure the reason as
  I'm not a little about xfs.

> > Reverted the commit on top of v6.3-rc2 kernel, at least the BUG dmesg was gone.
> > 
> > And this issue could be reproduced in v6.3-rc3 kernel also.
> > Is it possible that the above commit involves a new issue?
> > 
> > "
> > [   62.318653] loop0: detected capacity change from 0 to 65536
> > [   62.320459] XFS (loop0): Mounting V5 Filesystem d6f69dbd-8c5d-46be-b88e-92c0ae88ceb2
> > [   62.325152] XFS (loop0): Ending clean mount
> > [   62.326049] XFS (loop0): Quotacheck needed: Please wait.
> > [   62.328884] XFS (loop0): Quotacheck: Done.
> > [   62.363656] XFS (loop0): Metadata CRC error detected at xfs_agf_read_verify+0x10e/0x140, xfs_agf block 0x8001 
> > [   62.364489] XFS (loop0): Unmount and run xfs_repair
> > [   62.364881] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > [   62.365398] 00000000: 58 41 47 46 00 00 00 01 00 00 00 01 00 00 40 00  XAGF..........@.
> > [   62.366026] 00000010: 00 00 00 02 00 00 00 03 00 00 00 00 00 00 00 01  ................
> > [   62.366657] 00000020: 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 04  ................
> > [   62.367285] 00000030: 00 00 00 04 00 00 3b 5f 00 00 3b 5c 00 00 00 00  ......;_..;\....
> > [   62.367927] 00000040: d6 f6 9d bd 8c 5d 46 be b8 8e 92 c0 ae 88 ce b2  .....]F.........
> > [   62.368554] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > [   62.369180] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > [   62.369806] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > [   62.370471] XFS (loop0): metadata I/O error in "xfs_read_agf+0xd0/0x200" at daddr 0x8001 len 1 error 74
> > [   62.371312] XFS (loop0): page discard on page 00000000a6a1237b, inode 0x46, pos 0.
> > [   62.385968] BUG: kernel NULL pointer dereference, address: 0000000000000010
> > [   62.386541] #PF: supervisor write access in kernel mode
> > [   62.386960] #PF: error_code(0x0002) - not-present page
> > [   62.387370] PGD 0 P4D 0 
> > [   62.387588] Oops: 0002 [#1] PREEMPT SMP NOPTI
> > [   62.387945] CPU: 1 PID: 74 Comm: kworker/u4:3 Not tainted 6.3.0-rc3-kvm-e8d018dd #1
> > [   62.388545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [   62.389426] Workqueue: writeback wb_workfn (flush-7:0)
> > [   62.389845] RIP: 0010:xfs_filestream_select_ag+0x5d5/0xac0
> 
> What source line and/or instruction does %rip point to?
> Considering that this is a null pointer deference, you ought to be able
> to identify which pointer access did this.
> 
> If you are going to run some scripted tool to randomly corrupt the
> filesystem to find failures, then you have an ethical and moral
> responsibility to do some of the work to narrow down and identify the
> cause of the failure, not just throw them at someone to do all the work.
> 
 You are right, sorry, I should provide RIP and all other detailed info I have
next time.
 Below info is from above repro.report:
"
BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0 
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 34 Comm: kworker/u4:2 Not tainted 6.3.0-rc2-intel-next-38f821ff82e9+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-7:0)
RIP: 0010:arch_atomic_inc arch/x86/include/asm/atomic.h:95 [inline]
RIP: 0010:atomic_inc include/linux/atomic/atomic-instrumented.h:191 [inline]
RIP: 0010:xfs_filestream_create_association fs/xfs/xfs_filestream.c:321 [inline]
RIP: 0010:xfs_filestream_select_ag+0x5d5/0xce0 fs/xfs/xfs_filestream.c:372
"

Thanks!
BR.
-Pengfei
> --D
>