Re: [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 3 Sep 2020 09:22:56 -0300

On Thu, Sep 03, 2020 at 08:28:26AM +0300, Leon Romanovsky wrote:
> On Wed, Sep 02, 2020 at 09:18:27PM -0300, Jason Gunthorpe wrote:
> > On Sun, Aug 30, 2020 at 11:40:05AM +0300, Leon Romanovsky wrote:
> >
> > > -void mlx5_ib_destroy_srq(struct ib_srq *srq, struct ib_udata *udata)
> > > +int mlx5_ib_destroy_srq(struct ib_srq *srq, struct ib_udata *udata)
> > >  {
> > >  	struct mlx5_ib_dev *dev = to_mdev(srq->device);
> > >  	struct mlx5_ib_srq *msrq = to_msrq(srq);
> > > +	int ret;
> > > +
> > > +	ret = mlx5_cmd_destroy_srq(dev, &msrq->msrq);
> > > +	if (ret && udata)
> > > +		return ret;
> > >
> > > -	mlx5_cmd_destroy_srq(dev, &msrq->msrq);
> > > -
> > > -	if (srq->uobject) {
> > > -		mlx5_ib_db_unmap_user(
> > > -			rdma_udata_to_drv_context(
> > > -				udata,
> > > -				struct mlx5_ib_ucontext,
> > > -				ibucontext),
> > > -			&msrq->db);
> > > -		ib_umem_release(msrq->umem);
> > > -	} else {
> > > -		destroy_srq_kernel(dev, msrq);
> > > +	if (udata) {
> > > +		destroy_srq_user(srq->pd, msrq, udata);
> > > +		return 0;
> > >  	}
> > > +
> > > +	/* We are cleaning kernel resources anyway */
> > > +	destroy_srq_kernel(dev, msrq);
> >
> > Oh, and this isn't right.. If we are going to leak things then we have
> > to leak anything exposed for DMA as well, eg the fragbuf under the SRQ
> > has to be leaked.
> 
> We are leaking for ULPs only, from their perspective everything was
> released and WARN_ON() will be the sign of the problem.

If we are going to add back in error handling, then it needs to be
done right, there is no different between kernel and user, everything
should be leaked.

> > If the HW can't guarentee it stopped doing DMA then we can't return
> > memory under potentially active DMA back to the system.
> 
> ULPs are supposed to guarantee that all operations stopped.

ULP should never trigger this, only broken HW can cause this kind of
problem.

> I don't know, all those years we relied on the ULPs to do destroy
> properly and it worked well. I didn't hear any complain from the field
> that HW destroy failed in proper ULP flow.
> 
> It looks to me over-engineering.

Given mlx5 already has the fatal error handling it seems a reasonable
way to re-introduce the error code without just delcaring drivers are
buggy to use it..

Jason