On 7/14/22 8:59 PM, Bernard Metzler wrote: >> -----Original Message----- >> From: Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx> >> Sent: Thursday, 14 July 2022 03:31 >> To: jgg@xxxxxxxx; leon@xxxxxxxxxx; Bernard Metzler <BMT@xxxxxxxxxxxxxx> >> Cc: linux-rdma@xxxxxxxxxxxxxxx; chengyou@xxxxxxxxxxxxxxxxx >> Subject: [EXTERNAL] [PATCH for-next] RDMA/siw: Fix duplicated reported >> IW_CM_EVENT_CONNECT_REPLY event >> >> If siw_recv_mpa_rr returns -EAGAIN, it means that the MPA reply hasn't >> been received completely, and should not report IW_CM_EVENT_CONNECT_REPLY >> in this case. This may trigger a call trace in iw_cm. A simple way to >> trigger this: > > Great, thanks! I obviously did never hit an incomplete > MPA hdr. Please make another change to fix it correctly, > as suggested below. > > > case of an incomplete >> server: ib_send_lat >> client: ib_send_lat -R <server_ip> >> >> The call trace looks like this: >> >> kernel BUG at drivers/infiniband/core/iwcm.c:894! >> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >> <...> >> Workqueue: iw_cm_wq cm_work_handler [iw_cm] >> Call Trace: >> <TASK> >> cm_work_handler+0x1dd/0x370 [iw_cm] >> process_one_work+0x1e2/0x3b0 >> worker_thread+0x49/0x2e0 >> ? rescuer_thread+0x370/0x370 >> kthread+0xe5/0x110 >> ? kthread_complete_and_exit+0x20/0x20 >> ret_from_fork+0x1f/0x30 >> </TASK> >> >> Signed-off-by: Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx> >> --- >> drivers/infiniband/sw/siw/siw_cm.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/infiniband/sw/siw/siw_cm.c >> b/drivers/infiniband/sw/siw/siw_cm.c >> index 17f34d584cd9..f88d2971c2c6 100644 >> --- a/drivers/infiniband/sw/siw/siw_cm.c >> +++ b/drivers/infiniband/sw/siw/siw_cm.c >> @@ -725,11 +725,11 @@ static int siw_proc_mpareply(struct siw_cep *cep) >> enum mpa_v2_ctrl mpa_p2p_mode = MPA_V2_RDMA_NO_RTR; >> >> rv = siw_recv_mpa_rr(cep); >> - if (rv != -EAGAIN) >> - siw_cancel_mpatimer(cep); >> if (rv) >> goto out_err; >> >> + siw_cancel_mpatimer(cep); >> + > > Cancel the MPA timer only if we have a > real error. -EAGAIN translates to just > further waiting. So best to add the timer > cancellation to the error bailout section. > >> rep = &cep->mpa.hdr; >> >> if (__mpa_rr_revision(rep->params.bits) > MPA_REVISION_2) { >> @@ -895,7 +895,8 @@ static int siw_proc_mpareply(struct siw_cep *cep) >> } >> >> out_err: >> - siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, -EINVAL); >> + if (rv != -EAGAIN) > { > cancel MPA timer here. Indeed we do not need it here, because when siw_proc_mpareply returns error but not -EAGAIN, the release_cep will be set in the caller (siw_cm_work_handler), and siw_cancel_mpatimer will be called in the error handle flow. I think this is better, because the error handle is more unified. How do you think? Thanks, Cheng Xu > siw_cancel_mpatimer(cep); >> + siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, -EINVAL); > } >> >> return rv; >> } >> -- >> 2.37.0 >