On Sun, Nov 14, 2021 at 7:12 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > On Thu, Nov 11, 2021 at 05:25:11PM +0100, Haris Iqbal wrote: > > Hi, > > > > We are experiencing a hang while using RTRS with softROCE and > > transitioning the network interface down using ifdown command. > > > > Steps. > > 1) Map an RNBD/RTRS device through softROCE port. > > 2) Once mapped, transition the eth interface to down (on which the > > softROCE interface was created) using command "ifconfig <ethx> down", > > or "ip link set <ethx> down". > > 3) The device errors out, and one can see RTRS connection errors in > > dmesg. So far so good. > > 4) After a while, we see task hung traces in dmesg. > > > > [ 550.866462] INFO: task kworker/1:2:170 blocked for more than 184 seconds. > > [ 550.868820] Tainted: G O 5.10.42-pserver+ #84 > > [ 550.869337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 550.869963] task:kworker/1:2 state:D stack: 0 pid: 170 > > ppid: 2 flags:0x00004000 > > [ 550.870619] Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server] > > [ 550.871134] Call Trace: > > [ 550.871375] __schedule+0x421/0x810 > > [ 550.871683] schedule+0x46/0xb0 > > [ 550.871964] schedule_timeout+0x20e/0x2a0 > > [ 550.872300] ? internal_add_timer+0x44/0x70 > > [ 550.872650] wait_for_completion+0x86/0xe0 > > [ 550.872994] cm_destroy_id+0x18c/0x5a0 [ib_cm] > > [ 550.873357] ? _cond_resched+0x15/0x30 > > [ 550.873680] ? wait_for_completion+0x33/0xe0 > > [ 550.874036] _destroy_id+0x57/0x210 [rdma_cm] > > [ 550.874395] rtrs_srv_close_work+0xcc/0x250 [rtrs_server] > > [ 550.874819] process_one_work+0x1d4/0x370 > > [ 550.875156] worker_thread+0x4a/0x3b0 > > [ 550.875471] ? process_one_work+0x370/0x370 > > [ 550.875817] kthread+0xfe/0x140 > > [ 550.876098] ? kthread_park+0x90/0x90 > > [ 550.876453] ret_from_fork+0x1f/0x30 > > > > > > Our observations till now. > > > > 1) Does not occur if we use "ifdown <ethx>" instead. There is a > > difference between the commands, but we are not sure if the above one > > should lead to a task hang. > > https://access.redhat.com/solutions/27166 > > 2) We have verified v5.10 and v.15.1 kernels, and both have this issue. > > 3) We tried the same test with NvmeOf target and host over softROCE. > > We get the same task hang after doing "ifconfig .. down". > > > > [Tue Nov 9 14:28:51 2021] INFO: task kworker/1:1:34 blocked for more > > than 184 seconds. > > [Tue Nov 9 14:28:51 2021] Tainted: G O > > 5.10.42-pserver+ #84 > > [Tue Nov 9 14:28:51 2021] "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [Tue Nov 9 14:28:51 2021] task:kworker/1:1 state:D stack: 0 > > pid: 34 ppid: 2 flags:0x00004000 > > [Tue Nov 9 14:28:51 2021] Workqueue: events > > nvmet_rdma_release_queue_work [nvmet_rdma] > > [Tue Nov 9 14:28:51 2021] Call Trace: > > [Tue Nov 9 14:28:51 2021] __schedule+0x421/0x810 > > [Tue Nov 9 14:28:51 2021] schedule+0x46/0xb0 > > [Tue Nov 9 14:28:51 2021] schedule_timeout+0x20e/0x2a0 > > [Tue Nov 9 14:28:51 2021] ? internal_add_timer+0x44/0x70 > > [Tue Nov 9 14:28:51 2021] wait_for_completion+0x86/0xe0 > > [Tue Nov 9 14:28:51 2021] cm_destroy_id+0x18c/0x5a0 [ib_cm] > > [Tue Nov 9 14:28:51 2021] ? _cond_resched+0x15/0x30 > > [Tue Nov 9 14:28:51 2021] ? wait_for_completion+0x33/0xe0 > > [Tue Nov 9 14:28:51 2021] _destroy_id+0x57/0x210 [rdma_cm] > > [Tue Nov 9 14:28:51 2021] nvmet_rdma_free_queue+0x2e/0xc0 [nvmet_rdma] > > [Tue Nov 9 14:28:51 2021] nvmet_rdma_release_queue_work+0x19/0x50 [nvmet_rdma] > > [Tue Nov 9 14:28:51 2021] process_one_work+0x1d4/0x370 > > [Tue Nov 9 14:28:51 2021] worker_thread+0x4a/0x3b0 > > [Tue Nov 9 14:28:51 2021] ? process_one_work+0x370/0x370 > > [Tue Nov 9 14:28:51 2021] kthread+0xfe/0x140 > > [Tue Nov 9 14:28:51 2021] ? kthread_park+0x90/0x90 > > [Tue Nov 9 14:28:51 2021] ret_from_fork+0x1f/0x30 > > > > Is this an known issue with ifconfig or the rxe driver? Thoughts? > > It doesn't look related to RXE and if to judge by the call trace the > issue is one of two: missing call to cm_deref_id() or extra call > to refcount_inc(&cm_id_priv->refcount). Both can be related to the > callers of rdma-cm interface. For example by not releasing cm_id > properly. Hi, @Aleksei did some more investigation and found out the cause of the hang. 1. enabled traces for rdma_cma and ib_cma events. 2. When we call ifdown (the one that does not hung) the following trace can be observed kworker/1:2-100 [001] .... 98.191532: cm_disconnect: cm.id=2 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] dN.. 98.208935: icm_send_drep_err: local_id=3791663578 remote_id=2340456478 state=TIMEWAIT lap_state=LAP_UNINIT kworker/1:2-100 [001] .N.. 98.213924: cm_disconnect: cm.id=3 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] dN.. 98.218595: icm_send_drep_err: local_id=3808440794 remote_id=2323679262 state=TIMEWAIT lap_state=LAP_UNINIT kworker/1:2-100 [001] .N.. 98.219583: cm_disconnect: cm.id=4 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] dN.. 98.228200: icm_send_drep_err: local_id=3758109146 remote_id=2306902046 state=TIMEWAIT lap_state=LAP_UNINIT kworker/1:2-100 [001] .... 98.322883: cm_qp_destroy: cm.id=2 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=17 kworker/1:2-100 [001] .... 98.322913: cm_id_destroy: cm.id=2 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] .... 98.332065: cm_qp_destroy: cm.id=3 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=18 kworker/1:2-100 [001] .... 98.332115: cm_id_destroy: cm.id=3 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] .... 98.340362: cm_qp_destroy: cm.id=4 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=19 kworker/1:2-100 [001] .... 98.340400: cm_id_destroy: cm.id=4 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 Explanation: In this example we have 3 cm.id (2,3,4) for each of them we call rdma_disconnect() that calls ib_send_cm_dreq which calls cm_send_dreq_locked but there we call ret = cm_alloc_msg(cm_id_priv, &msg); if (ret) { cm_enter_timewait(cm_id_priv); return ret; } which returns with the error before doing refcount_inc(&cm_id_priv->refcount); inside cm_alloc_msg. Since the error returns from rdma_create_ah. so the refcount for each cm.id is 1 after calling rdma_disconnect. Hence in cm_destroy_id we call once more cm_deref_id and do NOT hung on wait_for_completion(&cm_id_priv->comp); because refcount is 0. 3. In case of ifconfig down (the one that hung) the trace is following kworker/1:0-19 [001] .... 289.211230: cm_disconnect: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] d... 289.222918: icm_send_dreq: local_id=3636223036 remote_id=2994969923 state=ESTABLISHED lap_state=LAP_UNINIT kworker/1:0-19 [001] .N.. 289.224789: cm_sent_dreq: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/u5:0-63 [001] d... 289.224820: icm_mad_send_err: state=DREQ_SENT completion status=LOC_PROT_ERR kworker/u5:0-63 [001] .... 289.224821: cm_event_handler: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED (10/-110) kworker/u5:0-63 [001] .... 289.224822: cm_event_done: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED consumer returns 0 kworker/1:0-19 [001] .N.. 289.226797: cm_disconnect: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] dN.. 289.238556: icm_send_dreq: local_id=3619445820 remote_id=3045301571 state=ESTABLISHED lap_state=LAP_UNINIT kworker/1:0-19 [001] .N.. 289.240126: cm_sent_dreq: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] .N.. 289.241162: cm_disconnect: cm.id=7 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] dN.. 289.251148: icm_send_dreq: local_id=3602668604 remote_id=3028524355 state=ESTABLISHED lap_state=LAP_UNINIT kworker/1:0-19 [001] .N.. 289.252624: cm_sent_dreq: cm.id=7 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] .... 289.340055: cm_qp_destroy: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=19 kworker/1:0-19 [001] .... 289.340159: cm_id_destroy: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] .... 289.367393: cm_qp_destroy: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=20 kworker/1:0-19 [001] .... 289.367445: cm_id_destroy: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 Explanation: In this example we have 3 cm.id (5,6,7). for each of them we call rdma_disconnect() that calls ib_send_cm_dreq which calls cm_send_dreq_locked which calls cm_alloc_msg but in this case cm_alloc_msg does not return the error and increase the refcount for each cm.id to 2. And we do send dreq. But later only for cm.id=5 the send_handler is called with LOC_PROT_ERR icm_mad_send_err: state=DREQ_SENT completion status=LOC_PROT_ERR which triggers cm_process_send_error -> cm_free_msg -> cm_deref_id which decrease the refcount for cm.id=5 to 1. For other cm.id (6,7) the refcounts are 2 since the dreq were sent but nor completed nor timed out. Nothing happens with those dreq sent to cm.id=6,7 So later, when cm_id_destroy is called for cm.id=5 (refcount=1) we call once more cm_deref_id and do not hung on wait_for_completion(&cm_id_priv->comp); because refcount is 0. but for cm.i=6,7 we do cm_deref_id but refcount goes from 2 to 1 and we wait_for_completion forever, because those sent dreqs are still somewhere. 4. The reason why cm_alloc_msg in one case (ifdown "interface") returns with an error without refcount increase and in other case (ifconfig down "interface") increase refcount and return 0 is what happens in cm_alloc_msg. ( gid entry state in ib_gid_table. ) Explanation: The cm_alloc_msg calls the following functions: cm_alloc_msg -> rdma_create_ah -> rdma_lag_get_ah_roce_slave -> rdma_read_gid_attr_ndev_rcu -> is_gid_entry_valid which return entry && entry->state == GID_TABLE_ENTRY_VALID; In case of ifdown entry is either NULL or in the state GID_TABLE_ENTRY_PENDING_DEL so the function returns false. and cm_alloc_msg fails without refcount increase. In case of ifconfig down entry is in the state GID_TABLE_ENTRY_VALID so the function returns true. and cm_alloc_msg increase the refcount for cm.id. We think one of the below questions is the problem, 1. why cm_process_send_error was triggered only for 1 of 3 sent dreq? Why the cm_send_handler is not called for others? 2. When RTRS receives an error and starts closing the network, during which rdma_disconnect is called, shouldnt cm_alloc_msg fail in such a case? the same behavior is observed for RTRS and for NVMEof targets. > > Thanks > > > > > Regards > > -Haris