On 17.12.24 04:47, xiyan@xxxxxxxx wrote:
Hello RDMA Community, While testing the RoCEv2 feature of the Lustre file system, we encountered a crash issue related to ARP updates. Preliminary analysis suggests that this issue may be kernel-related, and it is also observed in the nvmeof environment,We are eager to receive your assistance. Below are the detailed information regarding the issue LU-18364. Thanks. Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode. 【OS】 VM Version: qemu-kvm-7.0.0 OS Verion: Rocky 8.10 Kernel Verion: 4.18.0-553.el8_10.x86_64 【Network Card】 Client: MLX CX6 1*100G RoCE v2 MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64 firmware-version: 22.35.2000 (MT_0000000359) Server: MLX CX6 1*100G RoCE v2 MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64 firmware-version: 22.35.2000 (MT_0000000359) 【Kernel Commit】 [PATCH rdma-next v2 2/2] RDMA/core: Add a netevent notifier to cma - Leon Romanovsky https://lore.kernel.org/all/bb255c9e301cd50b905663b8e73f7f5133d0e4c5.1654601342.git.leonro@xxxxxxxxxx/ 【Lustre Issue】 LU-18364:https://jira.whamcloud.com/browse/LU-18364 LU-18275:https://jira.whamcloud.com/browse/LU-18275 【Problem Reproduction Steps】 We've found a stable reproduction step for the crash issue: 1. We only use one network card, and do not use bonding. 2. Use vdbench run read/write test case on the lustre client. 3. Construct an ARP update for a lustre server IP address on the lustre client. for example, the lustre client ip is 192.168.122.220, the lustre server ip is 192.168.122.115, so do "arp -s 192.168.122.115 10:71:fc:69:92:b8 && arp -d 192.168.122.115" on 192.168.122.220, 10:71:fc:69:92:b8 is a wrong mac address. The crash stack is blow: KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 20 DATE: Tue Dec 3 14:58:41 CST 2024 UPTIME: 00:06:20 LOAD AVERAGE: 10.14, 2.56, 0.86 TASKS: 1076 NODENAME: rocky8vm3 RELEASE: 4.18.0-553.el8_10.x86_64 VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024 MACHINE: x86_64 (2599 Mhz) MEMORY: 31.4 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008" PID: 607 COMMAND: "kworker/u40:28" TASK: ff1e34360b6e0000 [THREAD_INFO: ff1e34360b6e0000] CPU: 1 STATE: TASK_RUNNING (PANIC)crash> bt PID: 607 TASK: ff1e34360b6e0000 CPU: 1 COMMAND: "kworker/u40:28" #0 [ff4de14b444cbbc0] machine_kexec at ffffffff9c46f2d3 #1 [ff4de14b444cbc18] __crash_kexec at ffffffff9c5baa5a #2 [ff4de14b444cbcd8] crash_kexec at ffffffff9c5bb991 #3 [ff4de14b444cbcf0] oops_end at ffffffff9c42d811 #4 [ff4de14b444cbd10] no_context at ffffffff9c481cf3 #5 [ff4de14b444cbd68] __bad_area_nosemaphore at ffffffff9c48206c #6 [ff4de14b444cbdb0] do_page_fault at ffffffff9c482cf7 #7 [ff4de14b444cbde0] page_fault at ffffffff9d0011ae [exception RIP: process_one_work+46] RIP: ffffffff9c51944e RSP: ff4de14b444cbe98 RFLAGS: 00010046 RAX: 0000000000000000 RBX: ff1e34360734b5d8 RCX: dead000000000200 RDX: 000000010001393f RSI: ff1e34360734b5d8 RDI: ff1e343ca7eed5c0 RBP: ff1e343600019400 R8: ff1e343d37c73bb8 R9: 0000005885358800 R10: 0000000000000000 R11: ff1e343d37c71dc4 R12: 0000000000000000 R13: ff1e343600019420 R14: ff1e3436000194d0 R15: ff1e343ca7eed5c0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ff4de14b444cbed8] worker_thread at ffffffff9c5197e0 #9 [ff4de14b444cbf10] kthread at ffffffff9c520e04 #10 [ff4de14b444cbf50] ret_from_fork at ffffffff9d00028f Another stack is below: [ 1656.060089] list_del corruption. next->prev should be ff4880c9d81b8d48, but was ff4880ccfb2d45e0
It seems that it is a memory corruption problem. The reason that causes this memory corruption is very complicated. Because you can reproduce this problem, perhaps some eBPF tools can help you to find out the root cause.
Zhu Yanjun
[ 1656.060536] ------------[ cut here ]------------ [ 1656.060538] kernel BUG at lib/list_debug.c:56! [ 1656.060738] invalid opcode: 0000 [#1] SMP NOPTI [ 1656.060872] CPU: 4 PID: 606 Comm: kworker/u40:27 Kdump: loaded Tainted: GF OE -------- - - 4.18.0-553.el8_10.x86_64 #1 [ 1656.061130] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+1408+7b966129 04/01/2014 [ 1656.061261] Workqueue: mlx5_cmd_0000:11:00.0 cmd_work_handler [mlx5_core] [ 1656.061457] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x48 [ 1656.061586] Code: 45 d4 99 e8 5e 52 c7 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 00 46 d4 99 e8 4a 52 c7 ff 0f 0b 48 c7 c7 b0 46 d4 99 e8 3c 52 c7 ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 70 46 d4 99 e8 28 52 c7 ff 0f 0b [ 1656.061846] RSP: 0018:ff650559444dfe90 EFLAGS: 00010046 [ 1656.061974] RAX: 0000000000000054 RBX: ff4880c9d81b8d40 RCX: 0000000000000000 [ 1656.062103] RDX: 0000000000000000 RSI: ff4880cf9731e698 RDI: ff4880cf9731e698 [ 1656.062238] RBP: ff4880c840019400 R08: 0000000000000000 R09: c0000000ffff7fff [ 1656.062366] R10: 0000000000000001 R11: ff650559444dfcb0 R12: ff4880c862647b00 [ 1656.062492] R13: ff4880c879326540 R14: 0000000000000000 R15: ff4880c9d81b8d48 [ 1656.062619] FS: 0000000000000000(0000) GS:ff4880cf97300000(0000) knlGS:0000000000000000 [ 1656.062745] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1656.062868] CR2: 000055cc1af6b000 CR3: 000000084b610006 CR4: 0000000000771ee0 [ 1656.062996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1656.063127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1656.063250] PKRU: 55555554 KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 20 DATE: Fri Nov 29 17:37:31 CST 2024 UPTIME: 00:27:35 LOAD AVERAGE: 350.47, 237.79, 163.91 TASKS: 1106 NODENAME: rocky8vm3 RELEASE: 4.18.0-553.el8_10.x86_64 VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024 MACHINE: x86_64 (2599 Mhz) MEMORY: 31.4 GB PANIC: "kernel BUG at lib/list_debug.c:56!" PID: 606 COMMAND: "kworker/u40:27" TASK: ff4880c8793f8000 [THREAD_INFO: ff4880c8793f8000] CPU: 4 STATE: TASK_RUNNING (PANIC)crash> bt PID: 606 TASK: ff4880c8793f8000 CPU: 4 COMMAND: "kworker/u40:27" #0 [ff650559444dfc28] machine_kexec at ffffffff98a6f2d3 #1 [ff650559444dfc80] __crash_kexec at ffffffff98bbaa5a #2 [ff650559444dfd40] crash_kexec at ffffffff98bbb991 #3 [ff650559444dfd58] oops_end at ffffffff98a2d811 #4 [ff650559444dfd78] do_trap at ffffffff98a29a27 #5 [ff650559444dfdc0] do_invalid_op at ffffffff98a2a766 #6 [ff650559444dfde0] invalid_op at ffffffff99600da4 [exception RIP: __list_del_entry_valid.cold.1+32] RIP: ffffffff98ef8f98 RSP: ff650559444dfe90 RFLAGS: 00010046 RAX: 0000000000000054 RBX: ff4880c9d81b8d40 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ff4880cf9731e698 RDI: ff4880cf9731e698 RBP: ff4880c840019400 R8: 0000000000000000 R9: c0000000ffff7fff R10: 0000000000000001 R11: ff650559444dfcb0 R12: ff4880c862647b00 R13: ff4880c879326540 R14: 0000000000000000 R15: ff4880c9d81b8d48 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ff650559444dfe90] process_one_work at ffffffff98b19557 #8 [ff650559444dfed8] worker_thread at ffffffff98b197e0 #9 [ff650559444dff10] kthread at ffffffff98b20e04 #10 [ff650559444dff50] ret_from_fork at ffffffff9960028f This bug seems to be in rdma_cm module on the MOFED/kernel side. So we try to reproduce the crash on the Nvme-oF node: 1. Mount the nvme-of disk, do "nvme connect -t rdma -n "nqn.2014-08.org.nvmexpress:67240ebd3fa63ca3" -a 192.168.122.30 -s 4421" 2. Use dd run write/read test case, for example, "dd if=/dev/nvme0n17 of=./test bs=32K count=102400 oflag=direct" 3. Construct an ARP update, do "arp -s 192.168.122.112 10:71:fe:69:93:b8 && arp -d 192.168.122.112" on the nvme_of client. 4. The crash is already reproduce. The issue may involve the following key points: 1. The RDMA module receives multiple network events simultaneously. 2. We have observed that during normal ARP updates, one or more events may be generated, making this issue probabilistic. 3. When both ARP update events and connection termination (conn disconnect) events are received at the same time, it triggers issue LU-18275.