Re: Task hung while using RTRS with rxe and using "ip" utility to down the interface

Haris Iqbal <haris.iqbal@xxxxxxxxx> · Wed, 17 Nov 2021 12:21:14 +0100

On Sun, Nov 14, 2021 at 7:12 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> On Thu, Nov 11, 2021 at 05:25:11PM +0100, Haris Iqbal wrote:
> > Hi,
> >
> > We are experiencing a hang while using RTRS with softROCE and
> > transitioning the network interface down using ifdown command.
> >
> > Steps.
> > 1) Map an RNBD/RTRS device through softROCE port.
> > 2) Once mapped, transition the eth interface to down (on which the
> > softROCE interface was created) using command "ifconfig <ethx> down",
> > or "ip link set <ethx> down".
> > 3) The device errors out, and one can see RTRS connection errors in
> > dmesg. So far so good.
> > 4) After a while, we see task hung traces in dmesg.
> >
> > [  550.866462] INFO: task kworker/1:2:170 blocked for more than 184 seconds.
> > [  550.868820]       Tainted: G           O      5.10.42-pserver+ #84
> > [  550.869337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  550.869963] task:kworker/1:2     state:D stack:    0 pid:  170
> > ppid:     2 flags:0x00004000
> > [  550.870619] Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server]
> > [  550.871134] Call Trace:
> > [  550.871375]  __schedule+0x421/0x810
> > [  550.871683]  schedule+0x46/0xb0
> > [  550.871964]  schedule_timeout+0x20e/0x2a0
> > [  550.872300]  ? internal_add_timer+0x44/0x70
> > [  550.872650]  wait_for_completion+0x86/0xe0
> > [  550.872994]  cm_destroy_id+0x18c/0x5a0 [ib_cm]
> > [  550.873357]  ? _cond_resched+0x15/0x30
> > [  550.873680]  ? wait_for_completion+0x33/0xe0
> > [  550.874036]  _destroy_id+0x57/0x210 [rdma_cm]
> > [  550.874395]  rtrs_srv_close_work+0xcc/0x250 [rtrs_server]
> > [  550.874819]  process_one_work+0x1d4/0x370
> > [  550.875156]  worker_thread+0x4a/0x3b0
> > [  550.875471]  ? process_one_work+0x370/0x370
> > [  550.875817]  kthread+0xfe/0x140
> > [  550.876098]  ? kthread_park+0x90/0x90
> > [  550.876453]  ret_from_fork+0x1f/0x30
> >
> >
> > Our observations till now.
> >
> > 1) Does not occur if we use "ifdown <ethx>" instead. There is a
> > difference between the commands, but we are not sure if the above one
> > should lead to a task hang.
> > https://access.redhat.com/solutions/27166
> > 2) We have verified v5.10 and v.15.1 kernels, and both have this issue.
> > 3) We tried the same test with NvmeOf target and host over softROCE.
> > We get the same task hang after doing "ifconfig .. down".
> >
> > [Tue Nov  9 14:28:51 2021] INFO: task kworker/1:1:34 blocked for more
> > than 184 seconds.
> > [Tue Nov  9 14:28:51 2021]       Tainted: G           O
> > 5.10.42-pserver+ #84
> > [Tue Nov  9 14:28:51 2021] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [Tue Nov  9 14:28:51 2021] task:kworker/1:1     state:D stack:    0
> > pid:   34 ppid:     2 flags:0x00004000
> > [Tue Nov  9 14:28:51 2021] Workqueue: events
> > nvmet_rdma_release_queue_work [nvmet_rdma]
> > [Tue Nov  9 14:28:51 2021] Call Trace:
> > [Tue Nov  9 14:28:51 2021]  __schedule+0x421/0x810
> > [Tue Nov  9 14:28:51 2021]  schedule+0x46/0xb0
> > [Tue Nov  9 14:28:51 2021]  schedule_timeout+0x20e/0x2a0
> > [Tue Nov  9 14:28:51 2021]  ? internal_add_timer+0x44/0x70
> > [Tue Nov  9 14:28:51 2021]  wait_for_completion+0x86/0xe0
> > [Tue Nov  9 14:28:51 2021]  cm_destroy_id+0x18c/0x5a0 [ib_cm]
> > [Tue Nov  9 14:28:51 2021]  ? _cond_resched+0x15/0x30
> > [Tue Nov  9 14:28:51 2021]  ? wait_for_completion+0x33/0xe0
> > [Tue Nov  9 14:28:51 2021]  _destroy_id+0x57/0x210 [rdma_cm]
> > [Tue Nov  9 14:28:51 2021]  nvmet_rdma_free_queue+0x2e/0xc0 [nvmet_rdma]
> > [Tue Nov  9 14:28:51 2021]  nvmet_rdma_release_queue_work+0x19/0x50 [nvmet_rdma]
> > [Tue Nov  9 14:28:51 2021]  process_one_work+0x1d4/0x370
> > [Tue Nov  9 14:28:51 2021]  worker_thread+0x4a/0x3b0
> > [Tue Nov  9 14:28:51 2021]  ? process_one_work+0x370/0x370
> > [Tue Nov  9 14:28:51 2021]  kthread+0xfe/0x140
> > [Tue Nov  9 14:28:51 2021]  ? kthread_park+0x90/0x90
> > [Tue Nov  9 14:28:51 2021]  ret_from_fork+0x1f/0x30
> >
> > Is this an known issue with ifconfig or the rxe driver? Thoughts?
>
> It doesn't look related to RXE and if to judge by the call trace the
> issue is one of two: missing call to cm_deref_id() or extra call
> to refcount_inc(&cm_id_priv->refcount). Both can be related to the
> callers of rdma-cm interface. For example by not releasing cm_id
> properly.

Hi,

@Aleksei did some more investigation and found out the cause of the hang.

1. enabled traces for rdma_cma and ib_cma events.
2. When we call ifdown (the one that does not hung) the following
trace can be observed

kworker/1:2-100     [001] ....    98.191532: cm_disconnect: cm.id=2
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100     [001] dN..    98.208935: icm_send_drep_err:
local_id=3791663578 remote_id=2340456478 state=TIMEWAIT
lap_state=LAP_UNINIT
kworker/1:2-100     [001] .N..    98.213924: cm_disconnect: cm.id=3
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100     [001] dN..    98.218595: icm_send_drep_err:
local_id=3808440794 remote_id=2323679262 state=TIMEWAIT
lap_state=LAP_UNINIT
kworker/1:2-100     [001] .N..    98.219583: cm_disconnect: cm.id=4
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100     [001] dN..    98.228200: icm_send_drep_err:
local_id=3758109146 remote_id=2306902046 state=TIMEWAIT
lap_state=LAP_UNINIT
kworker/1:2-100     [001] ....    98.322883: cm_qp_destroy: cm.id=2
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=17
kworker/1:2-100     [001] ....    98.322913: cm_id_destroy: cm.id=2
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100     [001] ....    98.332065: cm_qp_destroy: cm.id=3
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=18
kworker/1:2-100     [001] ....    98.332115: cm_id_destroy: cm.id=3
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100     [001] ....    98.340362: cm_qp_destroy: cm.id=4
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=19
kworker/1:2-100     [001] ....    98.340400: cm_id_destroy: cm.id=4
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0

Explanation:
In this example we have 3 cm.id (2,3,4) for each of them we call
rdma_disconnect() that calls ib_send_cm_dreq which calls
cm_send_dreq_locked
but there we call

ret = cm_alloc_msg(cm_id_priv, &msg);
if (ret) {
    cm_enter_timewait(cm_id_priv);
    return ret;
}

which returns with the error before doing
refcount_inc(&cm_id_priv->refcount); inside cm_alloc_msg. Since the
error returns from rdma_create_ah.
so the refcount for each cm.id is 1 after calling rdma_disconnect.
Hence in cm_destroy_id we call once more cm_deref_id and do NOT hung
on

wait_for_completion(&cm_id_priv->comp);

because refcount is 0.

3. In case of ifconfig down (the one that hung) the trace is following

kworker/1:0-19      [001] ....   289.211230: cm_disconnect: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19      [001] d...   289.222918: icm_send_dreq:
local_id=3636223036 remote_id=2994969923 state=ESTABLISHED
lap_state=LAP_UNINIT
kworker/1:0-19      [001] .N..   289.224789: cm_sent_dreq: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/u5:0-63      [001] d...   289.224820: icm_mad_send_err:
state=DREQ_SENT completion status=LOC_PROT_ERR
kworker/u5:0-63      [001] ....   289.224821: cm_event_handler:
cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED
(10/-110)
kworker/u5:0-63      [001] ....   289.224822: cm_event_done: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED consumer
returns 0
kworker/1:0-19      [001] .N..   289.226797: cm_disconnect: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19      [001] dN..   289.238556: icm_send_dreq:
local_id=3619445820 remote_id=3045301571 state=ESTABLISHED
lap_state=LAP_UNINIT
kworker/1:0-19      [001] .N..   289.240126: cm_sent_dreq: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19      [001] .N..   289.241162: cm_disconnect: cm.id=7
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19      [001] dN..   289.251148: icm_send_dreq:
local_id=3602668604 remote_id=3028524355 state=ESTABLISHED
lap_state=LAP_UNINIT
kworker/1:0-19      [001] .N..   289.252624: cm_sent_dreq: cm.id=7
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19      [001] ....   289.340055: cm_qp_destroy: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=19
kworker/1:0-19      [001] ....   289.340159: cm_id_destroy: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19      [001] ....   289.367393: cm_qp_destroy: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=20
kworker/1:0-19      [001] ....   289.367445: cm_id_destroy: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0

Explanation:
In this example we have 3 cm.id (5,6,7). for each of them we call
rdma_disconnect() that calls ib_send_cm_dreq which calls
cm_send_dreq_locked which calls cm_alloc_msg but in this case
cm_alloc_msg does not return the error and increase the refcount for
each cm.id to 2. And we do send dreq.
But later only for cm.id=5 the send_handler is called with LOC_PROT_ERR

icm_mad_send_err: state=DREQ_SENT completion status=LOC_PROT_ERR

which triggers cm_process_send_error -> cm_free_msg -> cm_deref_id
which decrease the refcount for cm.id=5 to 1.

For other cm.id (6,7) the refcounts are 2 since the dreq were sent but
nor completed nor timed out. Nothing happens with those dreq sent to
cm.id=6,7
So later, when cm_id_destroy is called for cm.id=5 (refcount=1) we
call once more cm_deref_id and do not hung on

wait_for_completion(&cm_id_priv->comp);

because refcount is 0.
but for cm.i=6,7 we do cm_deref_id but refcount goes from 2 to 1 and
we wait_for_completion forever, because those sent dreqs are still
somewhere.

4. The reason why cm_alloc_msg in one case (ifdown "interface")
returns with an error without refcount increase and in other case
(ifconfig down "interface") increase refcount and return 0 is what
happens in cm_alloc_msg. ( gid entry state in ib_gid_table. )

Explanation:
The cm_alloc_msg calls the following functions:
cm_alloc_msg -> rdma_create_ah -> rdma_lag_get_ah_roce_slave ->
rdma_read_gid_attr_ndev_rcu -> is_gid_entry_valid
which

return entry && entry->state == GID_TABLE_ENTRY_VALID;

In case of ifdown entry is either NULL or in the state
GID_TABLE_ENTRY_PENDING_DEL so the function returns false. and
cm_alloc_msg fails without refcount increase.
In case of ifconfig down entry is in the state GID_TABLE_ENTRY_VALID
so the function returns true. and cm_alloc_msg increase the refcount
for cm.id.

We think one of the below questions is the problem,

1. why cm_process_send_error was triggered only for 1 of 3 sent dreq?
Why the cm_send_handler is not called for others?
2. When RTRS receives an error and starts closing the network, during
which rdma_disconnect is called, shouldnt cm_alloc_msg fail in such a
case?

the same behavior is observed for RTRS and for NVMEof targets.

>
> Thanks
>
> >
> > Regards
> > -Haris