On Mon, Jan 23, 2017 at 12:10 PM, Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx> wrote: > On Mon, 2017-01-23 at 12:01 -0700, Robert LeBlanc wrote: >> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c >> index 8368764..ed36748 100644 >> --- a/drivers/infiniband/core/verbs.c >> +++ b/drivers/infiniband/core/verbs.c >> @@ -2089,3 +2089,19 @@ void ib_drain_qp(struct ib_qp *qp) >> ib_drain_rq(qp); >> } >> EXPORT_SYMBOL(ib_drain_qp); >> + >> +void ib_reset_sq(struct ib_qp *qp) >> +{ >> + struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET}; >> + int ret; >> + >> + ret = ib_modify_qp(qp, &attr, IB_QP_STATE); >> +} >> +EXPORT_SYMBOL(ib_reset_sq); >> + >> +void ib_reset_qp(struct ib_qp *qp) >> +{ >> + printk("ib_reset_qp calling ib_reset_sq.\n"); >> + ib_reset_sq(qp); >> +} >> +EXPORT_SYMBOL(ib_reset_qp); > > These are one liners. Is it really worth to add one-line functions to the > IB core? > >> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c >> b/drivers/infiniband/ulp/isert/ib_isert.c >> index 6dd43f6..619dbc7 100644 >> --- a/drivers/infiniband/ulp/isert/ib_isert.c >> +++ b/drivers/infiniband/ulp/isert/ib_isert.c >> @@ -2595,10 +2595,9 @@ static void isert_wait_conn(struct iscsi_conn *conn) >> isert_conn_terminate(isert_conn); >> mutex_unlock(&isert_conn->mutex); >> >> - ib_drain_qp(isert_conn->qp); >> + ib_reset_qp(isert_conn->qp); >> isert_put_unsol_pending_cmds(conn); >> - isert_wait4cmds(conn); >> - isert_wait4logout(isert_conn); >> + cancel_work_sync(&isert_conn->release_work); >> >> queue_work(isert_release_wq, &isert_conn->release_work); >> } > > Sorry but leaving out the ib_drain_qp() and isert_wait*() calls seems wrong > to me. Additionally, resetting the send queue should not be needed since the > iSER target driver should guarantee that no new WRs will be queued on the > send queue after isert_wait_conn() is called. > > Seeing this patch makes me wonder whether this behavior can be reproduced > with any other HBA than ConnectX-4 Lx? Is this a software or a firmware bug? > > Thanks, > > Bart. Yes, it all feels wrong which is why I need some guidance. The backtrace of the Infiniband and RoCE target D state processes are identical, I believe that there is an additional bug in the ConnectX-4-LX firmware that causes the D state problem on the target to be triggered much easier. The release notes seem to indicate that it may firmware bug may be fixed in 14.17.2020, but there is not a SuperMicro version yet and I can't match up the board IDs with enough confidence yet to flash the Mellanox firmware to test. In summary, I think there are two bugs. One in iSER causing the target to go into D state when something funky happens with the connection on both Infiniband and RoCE. And a second one in the ConnectX-4-LX firmware which easily triggers the first more critical issue. Is there someway to inspect what may be in the queue pair to see what may be blocking things? ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html