RE: rdma-for-next, rdma_rxe: inconsistent lock state

"Pearson, Robert B" <robert.pearson2@xxxxxxx> · Tue, 31 May 2022 20:55:20 +0000

-----Original Message-----
From: Bart Van Assche <bvanassche@xxxxxxx> 
Sent: Tuesday, May 31, 2022 3:47 PM
To: Bob Pearson <rpearsonhpe@xxxxxxxxx>
Cc: linux-rdma@xxxxxxxxxxxxxxx
Subject: rdma-for-next, rdma_rxe: inconsistent lock state

Hi Bob,

With the rdma-for-next branch (commit 9c477178a0a1 ("RDMA/rtrs-clt: Fix one kernel-doc comment")) I see the following:

================================
WARNING: inconsistent lock state
5.18.0-dbg #4 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/2/25 [HC0[0]:SC1[1]:HE0:SE0] takes:
ffff888116f0d350 (&xa->xa_lock#12){+.?.}-{2:2}, at: rxe_pool_get_index+0x73/0x170 [rdma_rxe] {SOFTIRQ-ON-W} state was registered at:
   __lock_acquire+0x45b/0xce0
   lock_acquire+0x18a/0x450
   _raw_spin_lock+0x34/0x50
   __rxe_add_to_pool+0xcc/0x140 [rdma_rxe]
   rxe_alloc_pd+0x2d/0x40 [rdma_rxe]
   __ib_alloc_pd+0xa3/0x270 [ib_core]
   ib_mad_port_open+0x44a/0x790 [ib_core]
   ib_mad_init_device+0x8e/0x110 [ib_core]
   add_client_context+0x26a/0x330 [ib_core]
   enable_device_and_get+0x169/0x2b0 [ib_core]
   ib_register_device+0x26f/0x330 [ib_core]
   rxe_register_device+0x1b4/0x1d0 [rdma_rxe]
   rxe_add+0x8c/0xc0 [rdma_rxe]
   rxe_net_add+0x5b/0x90 [rdma_rxe]
   rxe_newlink+0x71/0x80 [rdma_rxe]
   nldev_newlink+0x21e/0x370 [ib_core]
   rdma_nl_rcv_msg+0x200/0x410 [ib_core]
   rdma_nl_rcv+0x140/0x220 [ib_core]
   netlink_unicast+0x307/0x460
   netlink_sendmsg+0x422/0x750
   __sys_sendto+0x1c2/0x250
   __x64_sys_sendto+0x7f/0x90
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
irq event stamp: 71543
hardirqs last  enabled at (71542): [<ffffffff810cdc28>] __local_bh_enable_ip+0x88/0xf0 hardirqs last disabled at (71543): [<ffffffff81e9d67d>] _raw_spin_lock_irqsave+0x5d/0x60 softirqs last  enabled at (71532): [<ffffffff82200467>] __do_softirq+0x467/0x6e1 softirqs last disabled at (71537): [<ffffffff810cda47>] run_ksoftirqd+0x37/0x60

other info that might help us debug this:
  Possible unsafe locking scenario:
        CPU0
        ----
   lock(&xa->xa_lock#12);
   <Interrupt>
     lock(&xa->xa_lock#12);

  *** DEADLOCK ***
no locks held by ksoftirqd/2/25.

stack backtrace:
CPU: 2 PID: 25 Comm: ksoftirqd/2 Not tainted 5.18.0-dbg #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 Call Trace:
  <TASK>
  show_stack+0x52/0x58
  dump_stack_lvl+0x5b/0x82
  dump_stack+0x10/0x12
  print_usage_bug.part.0+0x29c/0x2ab
  mark_lock_irq.cold+0x54/0xbf
  mark_lock.part.0+0x3f5/0xa70
  mark_usage+0x74/0x1a0
  __lock_acquire+0x45b/0xce0
  lock_acquire+0x18a/0x450
  _raw_spin_lock_irqsave+0x43/0x60
  rxe_pool_get_index+0x73/0x170 [rdma_rxe]
  rxe_get_av+0xcc/0x140 [rdma_rxe]
  rxe_requester+0x34c/0xe60 [rdma_rxe]
  rxe_do_task+0xcc/0x140 [rdma_rxe]
  tasklet_action_common.constprop.0+0x168/0x1b0
  tasklet_action+0x42/0x60
  __do_softirq+0x1d8/0x6e1
  run_ksoftirqd+0x37/0x60
  smpboot_thread_fn+0x302/0x410
  kthread+0x183/0x1c0
  ret_from_fork+0x1f/0x30
  </TASK>

Is this perhaps the same issue as what I reported on May 6 (https://lore.kernel.org/all/cf8b9980-3965-a4f6-07e0-d4b25755b0db@xxxxxxx/)?

Thanks,

Bart.

(from windows)

Yes. There is a lock level bug in rxe_pool.c that requires a patch to fix. I have one that is a temporary fix.
Zhu had one that he posted  while ago but was never accepted. I don't want to step on his toes.
This is related to the "AH bug" i.e. rdmacm holding locks while calling into the verbs APIs which is just plain evil.

I'll send you my patch.

Bob