Hello, I have identified a bug in the mlx5_core module, or some related component. Doing the following on a freshly provisioned Oracle Cloud bare metal node with this configuration [0] will reliably cause the entire instance to become unresponsive: rmmod mlx5_ib; rmmod mlx5_core; modprobe mlx5_core This also produces the following output: [ 331.267175] I/O error, dev sda, sector 35602992 op 0x0:(READ) flags 0x80700 phys_seg 33 prio class 0 [ 331.376575] I/O error, dev sda, sector 35600432 op 0x0:(READ) flags 0x84700 phys_seg 320 prio class 0 [ 331.487509] I/O error, dev sda, sector 35595064 op 0x0:(READ) flags 0x80700 phys_seg 159 prio class 0 [ 528.386085] INFO: task kworker/u290:0:453 blocked for more than 122 seconds. [ 528.470497] Not tainted 6.14.0-rc1 #1 [ 528.520546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 528.615268] INFO: task kworker/u290:3:820 blocked for more than 123 seconds. [ 528.699641] Not tainted 6.14.0-rc1 #1 [ 528.749690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 528.843577] INFO: task jbd2/sda1-8:1128 blocked for more than 123 seconds. [ 528.925922] Not tainted 6.14.0-rc1 #1 [ 528.975971] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 529.069854] INFO: task systemd-journal:1218 blocked for more than 123 seconds. [ 529.156382] Not tainted 6.14.0-rc1 #1 [ 529.206441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 529.300407] INFO: task kworker/u290:4:1828 blocked for more than 123 seconds. [ 529.385892] Not tainted 6.14.0-rc1 #1 [ 529.435942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 529.529973] INFO: task rs:main Q:Reg:2184 blocked for more than 124 seconds. [ 529.614607] Not tainted 6.14.0-rc1 #1 [ 529.664656] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 529.758690] INFO: task gomon:2258 blocked for more than 124 seconds. [ 529.834832] Not tainted 6.14.0-rc1 #1 [ 529.884887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 529.978867] INFO: task kworker/u290:5:3255 blocked for more than 124 seconds. [ 530.064351] Not tainted 6.14.0-rc1 #1 [ 530.114398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 651.265588] INFO: task kworker/u290:0:453 blocked for more than 245 seconds. [ 651.349980] Not tainted 6.14.0-rc1 #1 [ 651.400028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 651.494126] INFO: task kworker/u290:3:820 blocked for more than 245 seconds. [ 651.578543] Not tainted 6.14.0-rc1 #1 [ 651.628600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. I tried using the function_graph tracer to identify if there were any functions within mlx5_core that were executing for an excessive amount of time, but did not find anything conclusive. Attached[1] is the stack trace that I see when I force the kernel to panic once a hang has been detected. I did this 3 times, and each trace was similar in that they all referred to ext4_* functions, which seems to line up with the I/O errors that I see each time. I should also note that I was able to trigger a similar I/O error on a DGX A100 one time (running Ubuntu-6.8.0-52-generic kernel and modules installed via a repackaged version of DOCA-OFED) - but I have not been able to reliably reproduce this issue on that machine with the pure upstream inbox drivers, like I can with the OCI instance. (I was also still able to interact with the A100 - but attempting to run any command resulted in a "command not found" error, which again lines up with the idea that this might have been interfering with ext4 somehow) Has anything like this been observed by other users? Please let me know if there is anything else I should do or provide to help debug this issue, or if there is already a known root cause. [0] System specs: OCI bare-metal Node, BM.Optimized3.36 shape with RoCE connectivity to another identical node Kernel: mainline @ 6.14.0-rc1 with this config: https://pastebin.ubuntu.com/p/5Jm2WFZY62/ ibstat output: https://pastebin.ubuntu.com/p/S5dfFSdDxd/ lscpu output: https://pastebin.ubuntu.com/p/dfPyYQWnhX/ [1] https://pastebin.ubuntu.com/p/kxw2dsmwFV/ -- Mitchell Augustin Software Engineer - Ubuntu Partner Engineering