Re: [BUG] rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect

Zhu Yanjun <yanjun.zhu@xxxxxxxxx> · Tue, 17 Dec 2024 13:50:36 +0100

On 17.12.24 04:47, xiyan@xxxxxxxx wrote:
Hello RDMA Community,
While testing the RoCEv2 feature of the Lustre file system, we encountered a crash issue related to ARP updates. Preliminary analysis suggests that this issue may be kernel-related, and it is also observed in the nvmeof environment，We are eager to receive your assistance. Below are the detailed information regarding the issue  LU-18364.
Thanks.

Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.
【OS】
VM Version: qemu-kvm-7.0.0
OS Verion: Rocky 8.10
Kernel Verion: 4.18.0-553.el8_10.x86_64

【Network Card】
Client：
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
firmware-version: 22.35.2000 (MT_0000000359)

Server:
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
firmware-version: 22.35.2000 (MT_0000000359)

【Kernel Commit】
[PATCH rdma-next v2 2/2] RDMA/core: Add a netevent notifier to cma - Leon Romanovsky
https://lore.kernel.org/all/bb255c9e301cd50b905663b8e73f7f5133d0e4c5.1654601342.git.leonro@xxxxxxxxxx/

【Lustre Issue】
LU-18364：https://jira.whamcloud.com/browse/LU-18364
LU-18275：https://jira.whamcloud.com/browse/LU-18275

【Problem Reproduction Steps】
We've found a stable reproduction step for the crash issue:
1. We only use one network card, and do not use bonding.
2. Use vdbench run read/write test case on the lustre client.
3. Construct an ARP update for a lustre server IP address on the lustre client.

for example, the lustre client ip is 192.168.122.220,  the lustre server ip is 192.168.122.115, so do "arp -s 192.168.122.115 10:71:fc:69:92:b8 && arp -d 192.168.122.115" on 192.168.122.220, 10:71:fc:69:92:b8 is a wrong mac address.

The crash stack is blow:
       KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux  [TAINTED]
     DUMPFILE: vmcore  [PARTIAL DUMP]
         CPUS: 20
         DATE: Tue Dec  3 14:58:41 CST 2024
       UPTIME: 00:06:20
LOAD AVERAGE: 10.14, 2.56, 0.86
        TASKS: 1076
     NODENAME: rocky8vm3
      RELEASE: 4.18.0-553.el8_10.x86_64
      VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024
      MACHINE: x86_64  (2599 Mhz)
       MEMORY: 31.4 GB
        PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"
          PID: 607
      COMMAND: "kworker/u40:28"
         TASK: ff1e34360b6e0000  [THREAD_INFO: ff1e34360b6e0000]
          CPU: 1
        STATE: TASK_RUNNING (PANIC)crash> bt
PID: 607      TASK: ff1e34360b6e0000  CPU: 1    COMMAND: "kworker/u40:28"
  #0 [ff4de14b444cbbc0] machine_kexec at ffffffff9c46f2d3
  #1 [ff4de14b444cbc18] __crash_kexec at ffffffff9c5baa5a
  #2 [ff4de14b444cbcd8] crash_kexec at ffffffff9c5bb991
  #3 [ff4de14b444cbcf0] oops_end at ffffffff9c42d811
  #4 [ff4de14b444cbd10] no_context at ffffffff9c481cf3
  #5 [ff4de14b444cbd68] __bad_area_nosemaphore at ffffffff9c48206c
  #6 [ff4de14b444cbdb0] do_page_fault at ffffffff9c482cf7
  #7 [ff4de14b444cbde0] page_fault at ffffffff9d0011ae
     [exception RIP: process_one_work+46]
     RIP: ffffffff9c51944e  RSP: ff4de14b444cbe98  RFLAGS: 00010046
     RAX: 0000000000000000  RBX: ff1e34360734b5d8  RCX: dead000000000200
     RDX: 000000010001393f  RSI: ff1e34360734b5d8  RDI: ff1e343ca7eed5c0
     RBP: ff1e343600019400   R8: ff1e343d37c73bb8   R9: 0000005885358800
     R10: 0000000000000000  R11: ff1e343d37c71dc4  R12: 0000000000000000
     R13: ff1e343600019420  R14: ff1e3436000194d0  R15: ff1e343ca7eed5c0
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #8 [ff4de14b444cbed8] worker_thread at ffffffff9c5197e0
  #9 [ff4de14b444cbf10] kthread at ffffffff9c520e04
#10 [ff4de14b444cbf50] ret_from_fork at ffffffff9d00028f

Another stack is below:
[ 1656.060089] list_del corruption. next->prev should be ff4880c9d81b8d48, but was ff4880ccfb2d45e0

It seems that it is a memory corruption problem. The reason that causes 
this memory corruption is very complicated. Because you can reproduce 
this problem, perhaps some eBPF tools can help you to find out the root 
cause.

Zhu Yanjun

[ 1656.060536] ------------[ cut here ]------------
[ 1656.060538] kernel BUG at lib/list_debug.c:56!
[ 1656.060738] invalid opcode: 0000 [#1] SMP NOPTI
[ 1656.060872] CPU: 4 PID: 606 Comm: kworker/u40:27 Kdump: loaded Tainted: GF          OE     -------- -  - 4.18.0-553.el8_10.x86_64 #1
[ 1656.061130] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+1408+7b966129 04/01/2014
[ 1656.061261] Workqueue: mlx5_cmd_0000:11:00.0 cmd_work_handler [mlx5_core]
[ 1656.061457] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x48
[ 1656.061586] Code: 45 d4 99 e8 5e 52 c7 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 00 46 d4 99 e8 4a 52 c7 ff 0f 0b 48 c7 c7 b0 46 d4 99 e8 3c 52 c7 ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 70 46 d4 99 e8 28 52 c7 ff 0f 0b
[ 1656.061846] RSP: 0018:ff650559444dfe90 EFLAGS: 00010046
[ 1656.061974] RAX: 0000000000000054 RBX: ff4880c9d81b8d40 RCX: 0000000000000000
[ 1656.062103] RDX: 0000000000000000 RSI: ff4880cf9731e698 RDI: ff4880cf9731e698
[ 1656.062238] RBP: ff4880c840019400 R08: 0000000000000000 R09: c0000000ffff7fff
[ 1656.062366] R10: 0000000000000001 R11: ff650559444dfcb0 R12: ff4880c862647b00
[ 1656.062492] R13: ff4880c879326540 R14: 0000000000000000 R15: ff4880c9d81b8d48
[ 1656.062619] FS:  0000000000000000(0000) GS:ff4880cf97300000(0000) knlGS:0000000000000000
[ 1656.062745] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1656.062868] CR2: 000055cc1af6b000 CR3: 000000084b610006 CR4: 0000000000771ee0
[ 1656.062996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1656.063127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1656.063250] PKRU: 55555554
       KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux  [TAINTED]
     DUMPFILE: vmcore  [PARTIAL DUMP]
         CPUS: 20
         DATE: Fri Nov 29 17:37:31 CST 2024
       UPTIME: 00:27:35
LOAD AVERAGE: 350.47, 237.79, 163.91
        TASKS: 1106
     NODENAME: rocky8vm3
      RELEASE: 4.18.0-553.el8_10.x86_64
      VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024
      MACHINE: x86_64  (2599 Mhz)
       MEMORY: 31.4 GB
        PANIC: "kernel BUG at lib/list_debug.c:56!"
          PID: 606
      COMMAND: "kworker/u40:27"
         TASK: ff4880c8793f8000  [THREAD_INFO: ff4880c8793f8000]
          CPU: 4
        STATE: TASK_RUNNING (PANIC)crash> bt
PID: 606      TASK: ff4880c8793f8000  CPU: 4    COMMAND: "kworker/u40:27"
  #0 [ff650559444dfc28] machine_kexec at ffffffff98a6f2d3
  #1 [ff650559444dfc80] __crash_kexec at ffffffff98bbaa5a
  #2 [ff650559444dfd40] crash_kexec at ffffffff98bbb991
  #3 [ff650559444dfd58] oops_end at ffffffff98a2d811
  #4 [ff650559444dfd78] do_trap at ffffffff98a29a27
  #5 [ff650559444dfdc0] do_invalid_op at ffffffff98a2a766
  #6 [ff650559444dfde0] invalid_op at ffffffff99600da4
     [exception RIP: __list_del_entry_valid.cold.1+32]
     RIP: ffffffff98ef8f98  RSP: ff650559444dfe90  RFLAGS: 00010046
     RAX: 0000000000000054  RBX: ff4880c9d81b8d40  RCX: 0000000000000000
     RDX: 0000000000000000  RSI: ff4880cf9731e698  RDI: ff4880cf9731e698
     RBP: ff4880c840019400   R8: 0000000000000000   R9: c0000000ffff7fff
     R10: 0000000000000001  R11: ff650559444dfcb0  R12: ff4880c862647b00
     R13: ff4880c879326540  R14: 0000000000000000  R15: ff4880c9d81b8d48
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #7 [ff650559444dfe90] process_one_work at ffffffff98b19557
  #8 [ff650559444dfed8] worker_thread at ffffffff98b197e0
  #9 [ff650559444dff10] kthread at ffffffff98b20e04
#10 [ff650559444dff50] ret_from_fork at ffffffff9960028f

This bug seems to be in rdma_cm module on the MOFED/kernel side. So we try to reproduce the crash on the Nvme-oF node:
1. Mount the nvme-of disk, do "nvme connect -t rdma -n "nqn.2014-08.org.nvmexpress:67240ebd3fa63ca3" -a 192.168.122.30 -s 4421"
2. Use dd run write/read test case, for example, "dd if=/dev/nvme0n17 of=./test bs=32K count=102400 oflag=direct"
3. Construct an ARP update, do "arp -s 192.168.122.112 10:71:fe:69:93:b8 && arp -d 192.168.122.112" on the nvme_of client.
4. The crash is already reproduce.

The issue may involve the following key points:
1. The RDMA module receives multiple network events simultaneously.
2. We have observed that during normal ARP updates, one or more events may be generated, making this issue probabilistic.
3. When both ARP update events and connection termination (conn disconnect) events are received at the same time, it triggers issue LU-18275.