Hi folks:
We recently encountered a kernel stall issue related to infiniband
mad packet. We believe this is caused by some abnormal operations of
user-level programs.But I am not sure if the kernel driver needs to do
some protection.
In our cluster, multiple nodes will establish multiple RDMA connections
with each other. Then, when we start the program, the kernel prompts the
following error after one minute:
[ 5489.530125] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5489.535714] rcu: 55-....: (19027 ticks this GP)
idle=236/1/0x4000000000000000 softirq=13510/13510 fqs=4359
[ 5489.545554] (t=21015 jiffies g=279845 q=701559)
[ 5489.545557] NMI backtrace for cpu 55
[ 5489.545559] CPU: 55 PID: 44979 Comm: kworker/u196:1 Tainted: G OE
5.10.91-netease3-1 #2
[ 5489.545560] Hardware name: Inspur SA5212M5/YZMB-00882-10T, BIOS
4.1.21 08/25/2021
[ 5489.545582] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[ 5489.545584] Call Trace:
[ 5489.545586]
[ 5489.545590] dump_stack+0x6b/0x83
[ 5489.545593] nmi_cpu_backtrace.cold+0x32/0x69
[ 5489.545596] ? lapic_can_unplug_cpu+0x80/0x80
[ 5489.545599] nmi_trigger_cpumask_backtrace+0xd7/0xe0
[ 5489.545602] rcu_dump_cpu_stacks+0xa2/0xd0
[ 5489.545604] rcu_sched_clock_irq.cold+0x1ff/0x3d6
[ 5489.545607] ? trigger_load_balance+0x5a/0x240
[ 5489.545610] update_process_times+0x8c/0xc0
[ 5489.545613] tick_sched_handle+0x22/0x60
[ 5489.545615] tick_sched_timer+0x7c/0xb0
[ 5489.545617] ? tick_do_update_jiffies64.part.0+0xc0/0xc0
[ 5489.545619] __hrtimer_run_queues+0x12a/0x270
[ 5489.545621] hrtimer_interrupt+0x110/0x2c0
[ 5489.545623] __sysvec_apic_timer_interrupt+0x5f/0xd0
[ 5489.545627] asm_call_irq_on_stack+0xf/0x20
[ 5489.545627]
[ 5489.545629] sysvec_apic_timer_interrupt+0x72/0x80
[ 5489.545631] asm_sysvec_apic_timer_interrupt+0x12/0x20
OS: Debian 11
Kernel: 5.10.91
ofed: MLNX_OFED_LINUX-5.5-1.0.3.2
PS. Thread worker/u196:2+ib-comp-unb-w is consuming 100% CPU.
After some investigations, We found that the kernel thread spends too
much time in the function wait_for_response(). We know that in the
process of establishing an RDMA connection, when the local side sends
out a MAD request, it will wait for the response packet from the remote
side within a limited time, and it is in the wait_list at this time.
This wait_list is sorted according to the timeout time, which means the
complexity of insertion or search is O(n). When the remote application
does not call rdma_accept() for some reason, it will not send the
response packet. In this way, the entry will not be cleared from the
wait_list on the local side for a long time. And the local application
continues to initiate connections outward, which makes the elements in
the wait_list more and more, and the wait_for_response() longer and longer.
Regard,
Yacan