An issue of kernel stall caused by infiniband mad packet

Yacan Liu <liuyacan@xxxxxxxxxxxxxxxx> · Wed, 22 Feb 2023 20:40:52 +0800

Hi folks:

  We recently encountered a kernel stall issue related to infiniband 
mad packet. We believe this is caused by some abnormal operations of 
user-level programs.But I am not sure if the kernel driver needs to do 
some protection.

In our cluster, multiple nodes will establish multiple RDMA connections 
with each other. Then, when we start the program, the kernel prompts the 
following error after one minute:

[ 5489.530125] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5489.535714] rcu:     55-....: (19027 ticks this GP) 
idle=236/1/0x4000000000000000 softirq=13510/13510 fqs=4359
[ 5489.545554]  (t=21015 jiffies g=279845 q=701559)
[ 5489.545557] NMI backtrace for cpu 55
[ 5489.545559] CPU: 55 PID: 44979 Comm: kworker/u196:1 Tainted: G    OE 
   5.10.91-netease3-1 #2
[ 5489.545560] Hardware name: Inspur SA5212M5/YZMB-00882-10T, BIOS 
4.1.21 08/25/2021
[ 5489.545582] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[ 5489.545584] Call Trace:
[ 5489.545586]
[ 5489.545590]  dump_stack+0x6b/0x83
[ 5489.545593]  nmi_cpu_backtrace.cold+0x32/0x69
[ 5489.545596]  ? lapic_can_unplug_cpu+0x80/0x80
[ 5489.545599]  nmi_trigger_cpumask_backtrace+0xd7/0xe0
[ 5489.545602]  rcu_dump_cpu_stacks+0xa2/0xd0
[ 5489.545604]  rcu_sched_clock_irq.cold+0x1ff/0x3d6
[ 5489.545607]  ? trigger_load_balance+0x5a/0x240
[ 5489.545610]  update_process_times+0x8c/0xc0
[ 5489.545613]  tick_sched_handle+0x22/0x60
[ 5489.545615]  tick_sched_timer+0x7c/0xb0
[ 5489.545617]  ? tick_do_update_jiffies64.part.0+0xc0/0xc0
[ 5489.545619]  __hrtimer_run_queues+0x12a/0x270
[ 5489.545621]  hrtimer_interrupt+0x110/0x2c0
[ 5489.545623]  __sysvec_apic_timer_interrupt+0x5f/0xd0
[ 5489.545627]  asm_call_irq_on_stack+0xf/0x20
[ 5489.545627]
[ 5489.545629]  sysvec_apic_timer_interrupt+0x72/0x80
[ 5489.545631]  asm_sysvec_apic_timer_interrupt+0x12/0x20

OS:  Debian 11
Kernel: 5.10.91
ofed: MLNX_OFED_LINUX-5.5-1.0.3.2

PS. Thread worker/u196:2+ib-comp-unb-w is consuming 100% CPU.

After some investigations, We found that the kernel thread spends too 
much time in the function wait_for_response(). We know that in the 
process of establishing an RDMA connection, when the local side sends 
out a MAD request, it will wait for the response packet from the remote 
side within a limited time, and it is in the wait_list at this time. 
This wait_list is sorted according to the timeout time, which means the 
complexity of insertion or search is O(n). When the remote application 
does not call rdma_accept() for some reason, it will not send the 
response packet. In this way, the entry will not be cleared from the 
wait_list on the local side for a long time. And the local application 
continues to initiate connections outward, which makes the elements in 
the wait_list more and more, and the wait_for_response() longer and longer.

Regard,
Yacan