On Wed, 2018-01-03 at 07:13 +0200, Moni Shoua wrote: > > Does this perhaps mean that the rxe_qp structure can be freed while rxe_do_task() > > is in progress? Please note that the ib_srpt driver only destroys a QP > > (srpt_destroy_ch_ib() call in srpt_release_channel_work()) after all SCSI command > > processing has finished (transport_deregister_session()). > > If I understand right you say that the system is hung when trying to > take a lock in rxe_do_taks() (line 89). Is that right? > Anyway, It's possible that you hit a bug related to destroying a QP. Hello Moni, The issues I had reported may be unrelated. BTW, this is what I saw appearing in the system log a few minutes ago: Jan 3 13:03:56 ubuntu-vm kernel: ib_srpt:srpt_close_ch: ib_srpt 192.168.122.76-18: queued zerolength write Jan 3 13:03:56 ubuntu-vm kernel: rdma_rxe:rxe_completer: rdma_rxe: rxe_completer(): qp valid 1, state ERROR [ ... ] Jan 3 13:04:09 ubuntu-vm kernel: ib_srpt:srpt_disconnect_ch_sync: ib_srpt ch 192.168.122.76-18 state 3 [ ... ] Jan 3 13:04:14 ubuntu-vm kernel: ib_srpt srpt_disconnect_ch_sync(192.168.122.76-18 state 3): still waiting ... In other words, the ib_srpt driver had queued a zero-length write and changed the QP state into ERROR but no completion was queued for that zero-length write. The rdma_rxe log message was generated by the following code: diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index 6cdc40ed8a9f..f6c40edbddc6 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -550,6 +550,9 @@ int rxe_completer(void *arg) if (!qp->valid || qp->req.state == QP_STATE_ERROR || qp->req.state == QP_STATE_RESET) { + pr_debug("rxe_completer(): qp valid %d, state %s\n", + qp->valid, qp->req.state == QP_STATE_ERROR ? "ERROR" : + qp->req.state == QP_STATE_RESET ? "RESET" : "(?)"); rxe_drain_resp_pkts(qp, qp->valid && qp->req.state == QP_STATE_ERROR); goto exit; Bart.��.n��������+%������w��{.n�����{���fk��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f