Re: [PATCH rdma-next 09/15] RDMA: Clean destroy CQ in drivers do not return errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 29, 2019 at 07:39:45PM +0000, Saleem, Shiraz wrote:
> > Subject: [PATCH rdma-next 09/15] RDMA: Clean destroy CQ in drivers do not
> > return errors
> >
> > From: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> >
> > Like all other destroy commands, .destroy_cq() call is not supposed to fail. In all
> > flows, the attempt to return earlier caused to memory leaks.
> >
> > This patch converts .destroy_cq() to do not return any errors.
> >
> > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > ---
> >  drivers/infiniband/core/cq.c                  |  5 +-
> >  drivers/infiniband/core/verbs.c               |  3 +-
> >  drivers/infiniband/hw/bnxt_re/ib_verbs.c      | 13 ++---
> >  drivers/infiniband/hw/bnxt_re/ib_verbs.h      |  2 +-
> >  drivers/infiniband/hw/cxgb3/cxio_hal.c        |  6 +--
> >  drivers/infiniband/hw/cxgb3/cxio_hal.h        |  2 +-
> >  drivers/infiniband/hw/cxgb3/iwch_provider.c   |  8 +---
> >  drivers/infiniband/hw/cxgb4/cq.c              | 13 ++---
> >  drivers/infiniband/hw/cxgb4/iw_cxgb4.h        |  2 +-
> >  drivers/infiniband/hw/efa/efa.h               |  2 +-
> >  drivers/infiniband/hw/efa/efa_verbs.c         |  9 +---
> >  drivers/infiniband/hw/hns/hns_roce_cq.c       | 48 +++++++++----------
> >  drivers/infiniband/hw/hns/hns_roce_device.h   |  4 +-
> >  drivers/infiniband/hw/hns/hns_roce_hw_v1.c    | 14 ++----
> >  drivers/infiniband/hw/i40iw/i40iw_verbs.c     |  3 +-
> >  drivers/infiniband/hw/mlx4/cq.c               |  4 +-
> >  drivers/infiniband/hw/mlx4/mlx4_ib.h          |  2 +-
> >  drivers/infiniband/hw/mlx5/cq.c               |  4 +-
> >  drivers/infiniband/hw/mlx5/mlx5_ib.h          |  2 +-
> >  drivers/infiniband/hw/mthca/mthca_provider.c  |  4 +-
> >  drivers/infiniband/hw/nes/nes_utils.c         |  4 +-
> >  drivers/infiniband/hw/nes/nes_verbs.c         | 30 ++++--------
> >  drivers/infiniband/hw/ocrdma/ocrdma_hw.c      |  8 ++--
> >  drivers/infiniband/hw/ocrdma/ocrdma_hw.h      |  2 +-
> >  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  6 +--
> >  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h   |  2 +-
> >  drivers/infiniband/hw/qedr/verbs.c            | 20 +-------
> >  drivers/infiniband/hw/qedr/verbs.h            |  2 +-
> >  drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |  4 +-
> > drivers/infiniband/hw/usnic/usnic_ib_verbs.h  |  2 +-
> > drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c  |  6 +--
> >  .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  2 +-
> >  drivers/infiniband/sw/rdmavt/cq.c             |  6 +--
> >  drivers/infiniband/sw/rdmavt/cq.h             |  2 +-
> >  drivers/infiniband/sw/rxe/rxe_verbs.c         |  3 +-
> >  include/rdma/ib_verbs.h                       |  2 +-
> >  36 files changed, 82 insertions(+), 169 deletions(-)
> >
> [...]
>
> > diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > index 2c3685faa57a..a3c65e45a2bc 100644
> > --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > @@ -2517,9 +2517,8 @@ int bnxt_re_post_recv(struct ib_qp *ib_qp, const struct
> > ib_recv_wr *wr,  }
> >
> >  /* Completion Queues */
> > -int bnxt_re_destroy_cq(struct ib_cq *ib_cq, struct ib_udata *udata)
> > +void bnxt_re_destroy_cq(struct ib_cq *ib_cq, struct ib_udata *udata)
> >  {
> > -	int rc;
> >  	struct bnxt_re_cq *cq;
> >  	struct bnxt_qplib_nq *nq;
> >  	struct bnxt_re_dev *rdev;
> > @@ -2528,20 +2527,14 @@ int bnxt_re_destroy_cq(struct ib_cq *ib_cq, struct
> > ib_udata *udata)
> >  	rdev = cq->rdev;
> >  	nq = cq->qplib_cq.nq;
> >
> > -	rc = bnxt_qplib_destroy_cq(&rdev->qplib_res, &cq->qplib_cq);
> > -	if (rc) {
> > -		dev_err(rdev_to_dev(rdev), "Failed to destroy HW CQ");
> > -		return rc;
> > -	}
> > -	if (!IS_ERR_OR_NULL(cq->umem))
>
> Are we losing any useful debug data here? Is there a device error print to track the adminQ cmd failure?
> Same applies to other call sites where your removing prints.

In case of failure, we will see debug errors from bnxt_qplib_rcfw_send_message().
In addition, even before this patch, there is one place of bnxt_qplib_destroy_cq()
without checks of return value.

>
> Sorry if this is a naïve question but can we now run into a situation where the adminQ cmd failed for some reason
> (maybe FW didn't even issue it to HW), and the HW continues to write to the CQ which you have just freed in core?
> This is not to say leaking kernel memory is correct...but are we risking corruption?

It is not different from any other places, there is no silver bullet
proof against broken HW.

>
> > +	bnxt_qplib_destroy_cq(&rdev->qplib_res, &cq->qplib_cq);
> > +	if (!cq->umem)
> >  		ib_umem_release(cq->umem);
> >
> >  	atomic_dec(&rdev->cq_count);
> >  	nq->budget--;
> >  	kfree(cq->cql);
> >  	kfree(cq);
> > -
> > -	return 0;
> >  }
> >
>
> [...]
>
> > a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
> > index 3d7bde19838e..d2a230d6c0e7 100644
> > --- a/drivers/infiniband/hw/qedr/verbs.c
> > +++ b/drivers/infiniband/hw/qedr/verbs.c
> > @@ -962,14 +962,13 @@ int qedr_resize_cq(struct ib_cq *ibcq, int new_cnt,
> > struct ib_udata *udata)
> >  #define QEDR_DESTROY_CQ_MAX_ITERATIONS		(10)
> >  #define QEDR_DESTROY_CQ_ITER_DURATION		(10)
> >
> > -int qedr_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
> > +void qedr_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
> >  {
> >  	struct qedr_dev *dev = get_qedr_dev(ibcq->device);
> >  	struct qed_rdma_destroy_cq_out_params oparams;
> >  	struct qed_rdma_destroy_cq_in_params iparams;
> >  	struct qedr_cq *cq = get_qedr_cq(ibcq);
> >  	int iter;
> > -	int rc;
> >
> >  	DP_DEBUG(dev, QEDR_MSG_CQ, "destroy cq %p (icid=%d)\n", cq, cq-
> > >icid);
> >
> > @@ -980,10 +979,7 @@ int qedr_destroy_cq(struct ib_cq *ibcq, struct ib_udata
> > *udata)
> >  		goto done;
> >
> >  	iparams.icid = cq->icid;
> > -	rc = dev->ops->rdma_destroy_cq(dev->rdma_ctx, &iparams, &oparams);
> > -	if (rc)
> > -		return rc;
> > -
> > +	dev->ops->rdma_destroy_cq(dev->rdma_ctx, &iparams, &oparams);
> >  	dev->ops->common->chain_free(dev->cdev, &cq->pbl);
> >
> >  	if (udata) {
> > @@ -1014,9 +1010,6 @@ int qedr_destroy_cq(struct ib_cq *ibcq, struct ib_udata
> > *udata)
> >  		iter--;
> >  	}
> >
> > -	if (oparams.num_cq_notif != cq->cnq_notif)
> > -		goto err;
> > -
>
> Do you want to keep the check and the DP_ERR print for debug?

I relied on the comment above which says that such scenario is possible
in disaster and assumed that qedr will print enough information in other
places. Anyway, this "disaster" won't be related to CQ, but to some
internal HW failure and this print won't help too much.

>
> >  	/* Note that we don't need to have explicit code to wait for the
> >  	 * completion of the event handler because it is invoked from the EQ.
> >  	 * Since the destroy CQ ramrod has also been received on the EQ we can
> > @@ -1026,15 +1019,6 @@ int qedr_destroy_cq(struct ib_cq *ibcq, struct
> > ib_udata *udata)
> >  	cq->sig = ~cq->sig;
> >
> >  	kfree(cq);
> > -
> > -	return 0;
> > -
> > -err:
> > -	DP_ERR(dev,
> > -	       "CQ %p (icid=%d) not freed, expecting %d ints but got %d ints\n",
> > -	       cq, cq->icid, oparams.num_cq_notif, cq->cnq_notif);
> > -
> > -	return -EINVAL;
> >  }
> >
>



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux