On Fri, 2019-08-09 at 17:41 +0800, Lijun Ou wrote: > From: Yangyang Li <liyangyang20@xxxxxxxxxx> > > Even if no response from hardware, make sure that qp related > resources are completely released. > > Signed-off-by: Yangyang Li <liyangyang20@xxxxxxxxxx> > --- > drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 12 ++++-------- > 1 file changed, 4 insertions(+), 8 deletions(-) > > diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > index 7a14f0b..0409851 100644 > --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > @@ -4562,16 +4562,14 @@ static int > hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev, > { > struct hns_roce_cq *send_cq, *recv_cq; > struct ib_device *ibdev = &hr_dev->ib_dev; > - int ret; > + int ret = 0; > > if (hr_qp->ibqp.qp_type == IB_QPT_RC && hr_qp->state != > IB_QPS_RESET) { > /* Modify qp to reset before destroying qp */ > ret = hns_roce_v2_modify_qp(&hr_qp->ibqp, NULL, 0, > hr_qp->state, IB_QPS_RESET); > - if (ret) { > + if (ret) > ibdev_err(ibdev, "modify QP to Reset > failed.\n"); > - return ret; > - } > } > > send_cq = to_hr_cq(hr_qp->ibqp.send_cq); > @@ -4627,7 +4625,7 @@ static int hns_roce_v2_destroy_qp_common(struct > hns_roce_dev *hr_dev, > kfree(hr_qp->rq_inl_buf.wqe_list); > } > > - return 0; > + return ret; > } > > static int hns_roce_v2_destroy_qp(struct ib_qp *ibqp, struct ib_udata > *udata) > @@ -4637,11 +4635,9 @@ static int hns_roce_v2_destroy_qp(struct ib_qp > *ibqp, struct ib_udata *udata) > int ret; > > ret = hns_roce_v2_destroy_qp_common(hr_dev, hr_qp, udata); > - if (ret) { > + if (ret) > ibdev_err(&hr_dev->ib_dev, "Destroy qp 0x%06lx > failed(%d)\n", > hr_qp->qpn, ret); > - return ret; > - } > > if (hr_qp->ibqp.qp_type == IB_QPT_GSI) > kfree(hr_to_hr_sqp(hr_qp)); I don't know your hardware, but this patch sounds wrong/dangerous to me. As long as the resources this card might access are allocated by the kernel, you can't get random data corruption by the card writing to memory used elsewhere in the kernel. So if your card is not responding to your requests to free the resources, it would seem safer to leak those resources permanently than to free them and risk the card coming back to life long enough to corrupt memory reallocated to some other task. Only if you can guarantee me that there is no way your commands to the card will fail and then the card start working again later would I consider this patch safe. And if it's possible for the card to hang like this, should that be triggering a reset of the device? -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: B826A3330E572FDD Fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
Attachment:
signature.asc
Description: This is a digitally signed message part