在 2022/5/7 9:55, Yanjun Zhu 写道:
在 2022/5/7 9:29, Jason Gunthorpe 写道:
On Sat, May 07, 2022 at 08:29:31AM +0800, Yanjun Zhu wrote:
If I try to run the SRP test 002 with the soft-RoCE driver, the
following appears:
[ 749.901966] ================================
[ 749.903638] WARNING: inconsistent lock state
[ 749.905376] 5.18.0-rc5-dbg+ #1 Not tainted
[ 749.907039] --------------------------------
[ 749.908699] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 749.910646] ksoftirqd/5/40 [HC0[0]:SC1[1]:HE0:SE0] takes:
[ 749.912499] ffff88818244d350 (&xa->xa_lock#14){+.?.}-{2:2}, at:
rxe_pool_get_index+0x73/0x170 [rdma_rxe]
[ 749.914691] {SOFTIRQ-ON-W} state was registered at:
[ 749.916648] __lock_acquire+0x45b/0xce0
[ 749.918599] lock_acquire+0x18a/0x450
[ 749.920480] _raw_spin_lock+0x34/0x50
[ 749.922580] __rxe_add_to_pool+0xcc/0x140 [rdma_rxe]
[ 749.924583] rxe_alloc_pd+0x2d/0x40 [rdma_rxe]
[ 749.926394] __ib_alloc_pd+0xa3/0x270 [ib_core]
[ 749.928579] ib_mad_port_open+0x44a/0x790 [ib_core]
[ 749.930640] ib_mad_init_device+0x8e/0x110 [ib_core]
[ 749.932495] add_client_context+0x26a/0x330 [ib_core]
[ 749.934302] enable_device_and_get+0x169/0x2b0 [ib_core]
[ 749.936217] ib_register_device+0x26f/0x330 [ib_core]
[ 749.938020] rxe_register_device+0x1b4/0x1d0 [rdma_rxe]
[ 749.939794] rxe_add+0x8c/0xc0 [rdma_rxe]
[ 749.941552] rxe_net_add+0x5b/0x90 [rdma_rxe]
[ 749.943356] rxe_newlink+0x71/0x80 [rdma_rxe]
[ 749.945182] nldev_newlink+0x21e/0x370 [ib_core]
[ 749.946917] rdma_nl_rcv_msg+0x200/0x410 [ib_core]
[ 749.948657] rdma_nl_rcv+0x140/0x220 [ib_core]
[ 749.950373] netlink_unicast+0x307/0x460
[ 749.952063] netlink_sendmsg+0x422/0x750
[ 749.953672] __sys_sendto+0x1c2/0x250
[ 749.955281] __x64_sys_sendto+0x7f/0x90
[ 749.956849] do_syscall_64+0x35/0x80
[ 749.958353] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 749.959942] irq event stamp: 1411849
[ 749.961517] hardirqs last enabled at (1411848):
[<ffffffff810cdb28>]
__local_bh_enable_ip+0x88/0xf0
[ 749.963338] hardirqs last disabled at (1411849):
[<ffffffff81ebf24d>]
_raw_spin_lock_irqsave+0x5d/0x60
[ 749.965214] softirqs last enabled at (1411838):
[<ffffffff82200467>]
__do_softirq+0x467/0x6e1
[ 749.967027] softirqs last disabled at (1411843):
[<ffffffff810cd947>]
run_ksoftirqd+0x37/0x60
To this, Please use this patch series
news://nntp.lore.kernel.org:119/20220422194416.983549-1-yanjun.zhu@xxxxxxxxx
No, that is the wrong fix for this. This is mismatched lock modes with
the lookup path in the BH, the fix is to consistently use BH locking
with the xarray everwhere or to use RCU. I'm expecting to go with
Bob's RCU patch.
Bob's RCU patch causes some atomic problems. Not sure these problems
can be fixed properly.
I delved into Bob's rcu patch series, in this
https://patchwork.kernel.org/project/linux-rdma/patch/20220421014042.26985-9-rpearsonhpe@xxxxxxxxx/,
Sometimes __rxe_cleanup is called between spin_lock_irq and spin_unlock_irq.
With Bob's rcu patch, this will cause oop.
Best Regards,
Zhu Yanjun
Zhu Yanjun
We still need a proper patch for the AH problem.
Jason