Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Mon, 27 Feb 2023 06:59:02 -0800

On Mon, Feb 27, 2023 at 08:15:26AM -0500, Joel Fernandes wrote:
> 
> 
> > On Feb 27, 2023, at 3:03 AM, Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote:
> > 
> > Hi
> > 
> >> On Mon, Feb 27, 2023 at 2:30 PM Sanan Hasanov
> >> <sanan.hasanov@xxxxxxxxxxxxxxx> wrote:
> >> 
> >> Good day, dear maintainers,
> >> 
> >> We found a bug using a modified kernel configuration file used by syzbot.
> >> 
> >> We enhanced the coverage of the configuration file using our tool, klocalizer.
> >> 
> >> Kernel Branch: 6.2.0-next-20230221
> >> Kernel config: https://drive.google.com/file/d/1QKAQV11zjOwISifUc-skRBoTo3EXhutY/view?usp=share_link
> >> C Reproducer: Unfortunately, there is no reproducer yet.
> 
> Sanan/Zhoui,
> Could you also provide the full kernel dmesg? Could you enable CONFIG_DEBUG_INFO_DWARF5 and provide the vmlinux after the crash?
> 
> More comments below:
> 
> >> 
> >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> #PF: supervisor instruction fetch in kernel mode
> >> #PF: error_code(0x0010) - not-present page
> >> PGD 53756067 P4D 53756067 PUD 0
> >> Oops: 0010 [#1] PREEMPT SMP KASAN
> >> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-next-20230221 #1
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> >> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> >> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> >> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> >> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> >> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> >> FS:  0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> >> Call Trace:
> >> <IRQ>
> >> rcu_core+0x85d/0x1960
> >> __do_softirq+0x2e5/0xae2
> >> __irq_exit_rcu+0x11d/0x190
> >> irq_exit_rcu+0x9/0x20
> >> sysvec_apic_timer_interrupt+0x97/0xc0
> >> </IRQ>
> >> <TASK>
> >> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> RIP: 0010:default_idle+0xf/0x20
> >> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> >> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> >> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> >> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> >> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> >> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> >> default_idle_call+0x67/0xa0
> >> do_idle+0x361/0x440
> >> cpu_startup_entry+0x18/0x20
> >> start_secondary+0x256/0x300
> >> secondary_startup_64_no_verify+0xce/0xdb
> >> </TASK>
> >> Modules linked in:
> >> CR2: 0000000000000000
> >> ---[ end trace 0000000000000000 ]---
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> 
> I have seen this exact signature when the processor tries to execute a function that has a NULL address. That causes IP to goto 0 and the exception. Sounds like something corrupted rcu_head (Just a guess).

Quite possibly!  If so, then building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
might be helpful.

Once a reproducer is foud, of course...

							Thanx, Paul

> >> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> >> 
> >> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> >> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> >> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> >> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> >> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> >> FS:  0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> >> ----------------
> >> Code disassembly (best guess):
> >>   0:   89 07                   mov    %eax,(%rdi)
> >>   2:   49 c7 c0 08 00 00 00    mov    $0x8,%r8
> >>   9:   4d 29 c8                sub    %r9,%r8
> >>   c:   4c 01 c7                add    %r8,%rdi
> >>   f:   4c 29 c2                sub    %r8,%rdx
> >>  12:   e9 76 ff ff ff          jmp    0xffffff8d
> >>  17:   cc                      int3
> >>  18:   cc                      int3
> >>  19:   cc                      int3
> >>  1a:   cc                      int3
> >>  1b:   f3 0f 1e fa             endbr64
> >>  1f:   eb 07                   jmp    0x28
> >>  21:   0f 00 2d e3 8a 34 00    verw   0x348ae3(%rip)        # 0x348b0b
> >>  28:   fb                      sti
> >>  29:   f4                      hlt
> >> * 2a:   fa                      cli <-- trapping instruction
> 
> This probably happened before the crash and it is likely unrelated IMO. cli just means interrupts were enabled, the actual problem happened after softirq fired (likely at the tail end of the interrupt).
> 
> Thanks,
> 
>  - Joel 
> 
> 
> >>  2b:   c3                      ret
> >>  2c:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
> >>  33:   00 00 00 00
> >>  37:   0f 1f 40 00             nopl   0x0(%rax)
> >>  3b:   f3 0f 1e fa             endbr64
> >>  3f:   65                      gs
> >>