-----Original Message----- From: Bart Van Assche <bvanassche@xxxxxxx> Sent: Tuesday, May 31, 2022 3:47 PM To: Bob Pearson <rpearsonhpe@xxxxxxxxx> Cc: linux-rdma@xxxxxxxxxxxxxxx Subject: rdma-for-next, rdma_rxe: inconsistent lock state Hi Bob, With the rdma-for-next branch (commit 9c477178a0a1 ("RDMA/rtrs-clt: Fix one kernel-doc comment")) I see the following: ================================ WARNING: inconsistent lock state 5.18.0-dbg #4 Not tainted -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/25 [HC0[0]:SC1[1]:HE0:SE0] takes: ffff888116f0d350 (&xa->xa_lock#12){+.?.}-{2:2}, at: rxe_pool_get_index+0x73/0x170 [rdma_rxe] {SOFTIRQ-ON-W} state was registered at: __lock_acquire+0x45b/0xce0 lock_acquire+0x18a/0x450 _raw_spin_lock+0x34/0x50 __rxe_add_to_pool+0xcc/0x140 [rdma_rxe] rxe_alloc_pd+0x2d/0x40 [rdma_rxe] __ib_alloc_pd+0xa3/0x270 [ib_core] ib_mad_port_open+0x44a/0x790 [ib_core] ib_mad_init_device+0x8e/0x110 [ib_core] add_client_context+0x26a/0x330 [ib_core] enable_device_and_get+0x169/0x2b0 [ib_core] ib_register_device+0x26f/0x330 [ib_core] rxe_register_device+0x1b4/0x1d0 [rdma_rxe] rxe_add+0x8c/0xc0 [rdma_rxe] rxe_net_add+0x5b/0x90 [rdma_rxe] rxe_newlink+0x71/0x80 [rdma_rxe] nldev_newlink+0x21e/0x370 [ib_core] rdma_nl_rcv_msg+0x200/0x410 [ib_core] rdma_nl_rcv+0x140/0x220 [ib_core] netlink_unicast+0x307/0x460 netlink_sendmsg+0x422/0x750 __sys_sendto+0x1c2/0x250 __x64_sys_sendto+0x7f/0x90 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae irq event stamp: 71543 hardirqs last enabled at (71542): [<ffffffff810cdc28>] __local_bh_enable_ip+0x88/0xf0 hardirqs last disabled at (71543): [<ffffffff81e9d67d>] _raw_spin_lock_irqsave+0x5d/0x60 softirqs last enabled at (71532): [<ffffffff82200467>] __do_softirq+0x467/0x6e1 softirqs last disabled at (71537): [<ffffffff810cda47>] run_ksoftirqd+0x37/0x60 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xa->xa_lock#12); <Interrupt> lock(&xa->xa_lock#12); *** DEADLOCK *** no locks held by ksoftirqd/2/25. stack backtrace: CPU: 2 PID: 25 Comm: ksoftirqd/2 Not tainted 5.18.0-dbg #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 Call Trace: <TASK> show_stack+0x52/0x58 dump_stack_lvl+0x5b/0x82 dump_stack+0x10/0x12 print_usage_bug.part.0+0x29c/0x2ab mark_lock_irq.cold+0x54/0xbf mark_lock.part.0+0x3f5/0xa70 mark_usage+0x74/0x1a0 __lock_acquire+0x45b/0xce0 lock_acquire+0x18a/0x450 _raw_spin_lock_irqsave+0x43/0x60 rxe_pool_get_index+0x73/0x170 [rdma_rxe] rxe_get_av+0xcc/0x140 [rdma_rxe] rxe_requester+0x34c/0xe60 [rdma_rxe] rxe_do_task+0xcc/0x140 [rdma_rxe] tasklet_action_common.constprop.0+0x168/0x1b0 tasklet_action+0x42/0x60 __do_softirq+0x1d8/0x6e1 run_ksoftirqd+0x37/0x60 smpboot_thread_fn+0x302/0x410 kthread+0x183/0x1c0 ret_from_fork+0x1f/0x30 </TASK> Is this perhaps the same issue as what I reported on May 6 (https://lore.kernel.org/all/cf8b9980-3965-a4f6-07e0-d4b25755b0db@xxxxxxx/)? Thanks, Bart. (from windows) Yes. There is a lock level bug in rxe_pool.c that requires a patch to fix. I have one that is a temporary fix. Zhu had one that he posted while ago but was never accepted. I don't want to step on his toes. This is related to the "AH bug" i.e. rdmacm holding locks while calling into the verbs APIs which is just plain evil. I'll send you my patch. Bob