Hi Jason & Leon,
[Sorry to send again since the previous mail didn't hit the list]
With the mentioned commit, I got the below calltraces.
[ 1209.544359] INFO: task kworker/u8:3:615 blocked for more than 120
seconds.
[ 1209.545170] Tainted: G O 5.9.0-rc8+ #30
[ 1209.545341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1209.545814] task:kworker/u8:3 state:D stack: 0 pid: 615
ppid: 2 flags:0x00004000
[ 1209.547601] Workqueue: rdma_cm cma_work_handler [rdma_cm]
[ 1209.548645] Call Trace:
[ 1209.552120] __schedule+0x363/0x7b0
[ 1209.552903] schedule+0x44/0xb0
[ 1209.553093] schedule_preempt_disabled+0xe/0x10
[ 1209.553253] __mutex_lock.isra.10+0x26d/0x4e0
[ 1209.553421] __mutex_lock_slowpath+0x13/0x20
[ 1209.553596] ? __mutex_lock_slowpath+0x13/0x20
[ 1209.553751] mutex_lock+0x2f/0x40
[ 1209.554045] rdma_connect+0x52/0x770 [rdma_cm]
[ 1209.554215] ? vprintk_default+0x1f/0x30
[ 1209.554359] ? vprintk_func+0x62/0x100
[ 1209.554532] ? cma_cm_event_handler+0x2b/0xe0 [rdma_cm]
[ 1209.554688] ? printk+0x52/0x6e
[ 1209.554852] rtrs_clt_rdma_cm_handler+0x393/0x8b0 [rtrs_client]
[ 1209.555056] cma_cm_event_handler+0x2b/0xe0 [rdma_cm]
[ 1209.561823] ? cma_comp_exch+0x4e/0x60 [rdma_cm]
[ 1209.561883] ? cma_cm_event_handler+0x2b/0xe0 [rdma_cm]
[ 1209.561940] cma_work_handler+0x89/0xc0 [rdma_cm]
[ 1209.561988] process_one_work+0x20c/0x400
[...]
[ 1209.562323] INFO: task bash:630 blocked for more than 120 seconds.
[ 1209.562368] Tainted: G O 5.9.0-rc8+ #30
[ 1209.562404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1209.562456] task:bash state:D stack: 0 pid: 630
ppid: 618 flags:0x00004000
[ 1209.562515] Call Trace:
[ 1209.562559] __schedule+0x363/0x7b0
[ 1209.562603] schedule+0x44/0xb0
[ 1209.562642] schedule_preempt_disabled+0xe/0x10
[ 1209.562685] __mutex_lock.isra.10+0x26d/0x4e0
[ 1209.562727] ? vprintk_func+0x62/0x100
[ 1209.562765] ? vprintk_default+0x1f/0x30
[ 1209.562806] __mutex_lock_slowpath+0x13/0x20
[ 1209.562847] ? __mutex_lock_slowpath+0x13/0x20
[ 1209.562887] mutex_lock+0x2f/0x40
[ 1209.562936] rdma_destroy_id+0x31/0x60 [rdma_cm]
[ 1209.562983] destroy_cm.isra.24+0x2d/0x50 [rtrs_client]
[ 1209.563586] init_sess+0x23b/0xac0 [rtrs_client]
[ 1209.563691] ? wait_woken+0x80/0x80
[ 1209.563742] rtrs_clt_open+0x27f/0x540 [rtrs_client]
[ 1209.563791] ? remap_devs+0x150/0x150 [rnbd_client]
[ 1209.563867] find_and_get_or_create_sess+0x4a1/0x5e0 [rnbd_client]
[ 1209.563921] ? remap_devs+0x150/0x150 [rnbd_client]
[ 1209.563966] ? atomic_notifier_call_chain+0x1a/0x20
[ 1209.564011] ? vt_console_print+0x203/0x3d0
[ 1209.564051] ? up+0x32/0x50
[ 1209.564093] ? __irq_work_queue_local+0x4f/0x60
[ 1209.564132] ? irq_work_queue+0x1a/0x30
[ 1209.564169] ? wake_up_klogd.part.28+0x34/0x40
[ 1209.564208] ? vprintk_emit+0x126/0x2b0
[ 1209.564249] ? vprintk_default+0x1f/0x30
[ 1209.564292] rnbd_clt_map_device+0x74/0x7b0 [rnbd_client]
[ 1209.564341] rnbd_clt_map_device_store+0x39e/0x6c0 [rnbd_client]
From my understanding, it is caused by the rdma_connect need to hold
the handler_mutex
again in the path.
cma_ib_handler -> mutex_lock(&id_priv->handler_mutex)
-> cma_cm_event_handler
-> id_priv->id.event_handler = rtrs_clt_rdma_cm_handler
-> rdma_connect
-> mutex_lock(&id_priv->handler_mutex)
And seems nvme rdma could have the same issue since rdma_connect is
called by
nvme_rdma_cm_handler -> nvme_rdma_route_resolved.
Thanks,
Guoqing